## Programming of Data Analysis Project 1

**Francesco Troja**

***

Project 1

>Create a data set by simulating a real-world phenomenon of your choosing. Then rather than collect data related to the phenomenon, you should model and synthesise such data using Python.We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:
>- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
>- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
>- Synthesise/simulate a data set as closely matching their properties as possible.
>- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.


#### Installations

To execute this project, several Python libraries have been utilized. These libraries were chosen for their specific functionalities and capabilities, tailored to the requirements of the project:
1. `padas`: The library's powerful data structures, including DataFrames and Series, allowed for efficient organization and structuring of data, making it easy to perform various data operations, such as filtering, grouping, and aggregating.Pandas offered a wide range of functions for data cleaning and preparation, making it ideal for addressing real-world data challenges[1].
2. `matplotlib.pyplot`: It is a widely used library for data visualization in Python. It provides a flexible and comprehensive set of tools to create various types of plots and charts. Its versatility allows to create bar charts, line plots, scatter plots, histograms, and more, making it an essential tool for exploratory data analysis and presentation of findings[2].
3. `numpy`: It is imported in this context for its extensive capabilities in numerical and statistical operations. Numpy provides a wide range of probability distributions, functions for generating random numbers following these distributions, and tools for statistical calculations[3]. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#### Importing the Dataset

The provided dataset offers an extensive and detailed record of property house prices in Ireland, spanning the period from 2010 to 2023. It plays a significant role in capturing essential information about the real estate market in Ireland. This dataset provides valuable insights into the dynamics of property prices, trends, and fluctuations over a thirteen-year period. The housing market is a crucial element of a country's economy, and property prices reflect not only home values but also broader economic conditions, as well as the forces of supply and demand at play.

The dataset includes a wide range of variables, including the date of sale, property location, property type, and the sale price. These variables offer a rich source of information for analysis, allowing for the examination of various aspects of the property market, such as regional variations, property types, and the impact of economic events on house prices.

The dataset used was was discovered on the [Kaggle](https://www.kaggle.com/datasets/raphaelmapp/ireland-house-prices-2010-to-2023/code) website.

In Python, working with CSV files often involves using the `read_csv()` function from the Pandas library. This function acts as a crucial tool, facilitating the smooth import of CSV files into a Pandas DataFrame. The DataFrame represents the data in a structured format, enabling easy manipulation and analysis. The DataFrame format, offered by Pandas, facilitates straightforward data exploration, manipulation, and analysis. In order to import the csv file, the file path is passed as parameter. This file path specifies the location of the CSV file you want to import. The read_csv() function then reads the data from that file and converts it into a Pandas DataFrame[2].

In [None]:

df = pd.read_csv("Housing_Data_Jan2010_to_May2023_Cleaned.csv")

print (f'The dataset used is:\n {df}')

#### Data Exploration

Let's investigate the dataset's  structure and characteristics. Statistical analysis is a method for uncovering patterns and correlations in data. The goal is to provide a descriptive overview of the dataset and its variables. Let's have a look at the dataset's contents:

- The Pandas `head()` method is used to return the top n (default is 5) rows from a dataset.
- The Pandas `tail()` method is used to return the bottom n (default is 5) rows from a dataset[3].


In [None]:
print("the first 5 rows of the dataset:\n")
df.head()


In [None]:
print("the last 5 rows of the dataset:\n")
df.tail()

As evident from the provided code, the selected dataset comprises 597,527 rows and 8 columns. Additionally, it's apparent from this initial analysis that the dataset contains missing values. The dimensionality of the dataset, can be confirmed using the Pandas function `shape` that when used it returns a tuple where the first element represents the number of rows (observations) and the second element indicates the number of columns (variables) in the dataset[4].

In [None]:
print('The dimensions of the dataset are:\n')
df.shape

To gain further insights into the DataFrame, the `info()` function can be used. This function provides metadata about the DataFrame, including the column names, the count of non-null values in each column, and the data type for each column[5]:

In [None]:
print('Find below the full summary of the Dataset:\n')
df.info()

The analysis of the dataset reveals the following key findings:

1. The dataset consists of 8 columns:

    - Date
    - Address
    - County
    - Price
    - Full_market_price
    - VAT_Exclusive
    - Description_of_Property
    - Property_Size_Description

2. It is apparent that there are missing values present in some of the columns.
3. The dataset is composed of a mix of data types. Specifically, there are 5 columns with object data types (string of Text or mixed numeric and non-numeric values) and 3 numerical columns(Floating point numbers), which likely contain numeric information[6].

The analysis highlights an issue with the "date" column being stored as an object data type, which limits its utility for datetime operations. To resolve this, converting the "date" column to the datetime64 data type is necessary. The `to_datetime()` function, a part of Pandas, serves this purpose. By using this function, the "date" column can be transformed into a format that enables effective datetime operations on the dataset[7]. 


In [None]:
#convert Date into datatimes type
df["Date"] = pd.to_datetime(df["Date"],dayfirst=True, errors='coerce')
df["Date"].info()

As observed in the previous examination using the tail() function, it became apparent that there are entire rows containing null values in the dataset.To determine the precise count of rows with null values, `isnull()` function can be used. The function identifies and flags the presence of missing or null values in the dataset. This understanding is essential before proceeding with further data analysis and allows for informed decision-making regarding how to handle these null value rows[8].

In [None]:
#Taking Date Variable as references
df[df['Date'].isnull()]

Upon analysis, it's evident that the total number of rows containing null values amounts to 873. Given that these rows do not contain any valuable information, the next step involves removing them from the dataset. The function `drop()` is used for this purpose, which eliminates rows based on their index values. Typically, the index value represents a 0-based integer value assigned to each row. By specifying the row index, it can be deleted from the dataset[9].

In [None]:
#using inplace to change the dataset
df.drop(df.index[596655:597528], inplace= True)
df.tail()

The analysis will now shift to examining the missing values within the variables of the dataset. To determine of many missing values exist for each variable the `sum()` function can be chained on the `isnull()`[10].

In [None]:
print('The missing values are:\n')
df.isnull().sum()

The result of this analysis reveals that out of the 8 variables in the dataset, 5 of them contain missing values. Specifically, the "Date" variable has 1 missing value, while the "Full_Market_Price," "VAT_Exclusive," and "Description_of_Property" variables each have 11 missing values. Indeed, the approach to handling missing values depends on the specific analysis or task at hand. Identifying and understanding the nature of missing data is a critical aspect of data preparation. Depending on the goals of the analysis, various actions can be taken with missing values, including imputation (filling in missing values with estimated or calculated data) or employing data cleaning techniques to ensure the dataset's quality and suitability for the intended analysis[11].

#### Statistical information

In the process of analyzing a dataset, a crucial initial step involves determining the type of variable associated with each attribute. One fundamental property of variables is their level or scale of measurement, which dictates the permissible arithmetic operations and, consequently, specifies the applicable statistical tests. In statistics, there are four primary levels of measurement: **nominal**, **ordinal**, **interval**, and **ratio**. These levels are hierarchical, with each level possessing all the characteristics of the previous levels, and some additional features[12].

- Nominal Scale: This is the lowest level of measurement, indicating that variables possess distinct values, but no meaningful order can be established among them. When there are only two categories, such as gender, it is referred to as dichotomous or binary.
- Ordinal Scale: Positioned one level higher, the ordinal scale encompasses nominal information but allows for the establishment of a ranking. However, the distances between values are not interpretable, making it impossible to quantify the absolute distance between two values[13].

Variables with a nominal or ordinal scale are often termed categorical variables while Variables with ordinal,interval and Ratio scale are Continuous Variable[13].

- Ordinal variables, categorize information with a clear sense of order or ranking. However, it's important to emphasize that the intervals or gaps between these categories are not uniform or quantifiable. For example, consider customer satisfaction ratings such as "poor," "fair," "good," and "excellent." While these categories can be ranked, the differences between them are not consistent and cannot be precisely measured. Ordinal scales allow for the establishment of a ranking, indicating higher or lower positions, but they do not provide a basis for making detailed numerical comparisons[14].
- Interval Variables: This category permits the application of a wide array of statistical measures. However, it's essential to note that these measures cannot assume the existence of a 'true' zero point. On an interval scale, the zero point is a matter of convention rather than an absolute marker. For instance, Centigrade and Fahrenheit temperature scales both exhibit equal intervals of temperature defined by considering equal volumes of expansion. Yet, each scale establishes an arbitrary zero point, and numerical values from one scale can be translated into equivalent values on the other using a specific mathematical equation. The critical idea is that interval variables maintain their properties regardless of the choice of the zero point, as long as consistent transformations are applied[15].
- Ratio Variables: Representing the highest level of precision among all scales, ratio data is a subset of quantitative data. Unlike interval data, ratio data possesses a distinctive attribute: the presence of a "true zero." A zero measurement on a ratio scale is absolute, signifying that ratio data can never be negative. This characteristic enables the full range of mathematical operations, including addition, subtraction, multiplication, and division, during statistical analyses[16].



#### Categorical Variables

For categorical data, a common summary measure is the count of observations for a specific category or percentage that each category contributes to the entire dataset. To visually represent this information, a frequency table can be utilized, often accompanied by a bar chart or pie chart. A frequency table displays the occurrence of each unique value within a column, providing both tabular and graphical representations[17].

To identify the distinct values of categorical variables within the dataset, the Pandas `unique()` function can be employed. This function returns an array containing the unique values found in a specified column. For instance[18]:

In [None]:
# create a list that includes all the categorical variables
cat_var = ['Address', 'County', 'Description_of_Property', 'Property_Size_Description', ]

for variable in cat_var:
    unique_value = df[variable].unique()
    print(f'\nUnique {variable} in the dataset:\n', unique_value)

The output above reveals an issue where the "Description_of_Property" and "Property_Size_Description" attributes contain gibberish text. Further analysis, facilitated by the English-Gaelic translator ([focloir](https://www.focloir.ie/en/dictionary/ei/dwelling+house)), indicates that this gibberish text corresponds to the Gaelic version of the English text. To enhance the clarity of the dataset for analysis, it is preferable to remove the Gaelic version and replace it with the English equivalent. The `replace()` function will be employed for this purpose[19].

In [None]:
# create a dictionary of substrings as key-value pairs
char_to_replace = {
    "n?os l? n? 38 m?adar cearnach": "less than 38 sq metres",
    "Teach/?ras?n C?naithe Nua": "Second-Hand Dwelling house /Apartment",
    "Teach/�ras�n C�naithe Nua": "Second-Hand Dwelling house /Apartment",
    "Teach/�ras�n C�naithe Ath�imhe": "New Dwelling house /Apartment",
    "n�os m� n� n� cothrom le 38 m�adar cearnach agus n�os l� n� 125 m�adar cearnach": 'greater than or equal to 38 sq metres and less than 125 sq metres'
}

for old_text, new_text in char_to_replace.items():
    df['Property_Size_Description'] = df['Property_Size_Description'].replace(old_text, new_text)
    df['Description_of_Property'] = df['Description_of_Property'].replace(old_text, new_text)

let's move on with the analysis. To obtain the count of unique values for a categorical variable, the function [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) can be used. This function provides a series containing counts of unique values in descending order.

In [None]:
cat_var = ['County', 'Address', 'Description_of_Property', 'Property_Size_Description']

for variable in cat_var:
    count = df[variable].value_counts()
    print (f'\nValue counts for column:\n {count}')

Now, let's proceed to visualize the distribution of categorical data by plotting a pie chart. This graphical representation will illustrate the percentage contribution of each category to the overall dataset.

In [None]:
county_value = df.County.value_counts()
desc_prop_value = df.Description_of_Property.value_counts()
size_prop_value = df.Property_Size_Description.value_counts()

fig = plt.figure(figsize=(50, 25)) # create a figure with a 50 width, 25 length

ax1 = plt.subplot(131) #subplot with 1 row, 3 columns the 1st one
ax2 = plt.subplot(132) #subplot with 1 row, 3 columns the 2nd one
ax3 = plt.subplot(133) #subplot with 1 row, 3 columns the 3rd one

county_value.plot(kind='pie', x=county_value, y = county_value.index, autopct='%1.1f%%', ax= ax1)
desc_prop_value.plot(kind='pie', x=desc_prop_value, y = desc_prop_value.index, autopct='%1.1f%%', ax= ax2)
size_prop_value.plot(kind='pie', x=size_prop_value, y = size_prop_value.index, autopct='%1.1f%%', ax= ax3)
ax1.set_title('Counties',  fontsize=40)
ax2.set_title('Property Description',  fontsize=40)
ax3.set_title('Property size', fontsize=40)
plt.savefig("percentage of categorical variable")
plt.show()


### References

[1]: Chugh v., (2023). "*Python pandas tutorial: The ultimate guide for beginners*".[Datacamp](https://www.datacamp.com/tutorial/pandas)

[2]: matplotlib, (n.d.). "*matplotlib.pyplot*". [matplotlib](https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html)

[2]: Analyseup, (n.d.). "*Importing Data with Pandas*". [Analyseup](https://www.analyseup.com/learn-python-for-data-science/python-pandas-importing-data.html)

[3]: Shazra H., (2023). "*head () and tail () Functions Explained with Examples and Codes*". [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2023/07/head-and-tail-functions/)

[4]: Pandas, (n.d.). "*pandas.DataFrame.shape*".[Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

[5]: Rajan S., (2023). "*Python | Pandas dataframe.info()*". [geeksforgeeks](https://www.geeksforgeeks.org/python-pandas-dataframe-info/)

[6]: Moffitt C., (2018). "*Overview of Pandas Data Types*". [Practical Business Python](https://pbpython.com/pandas_dtypes.html#:~:text=An%20object%20is%20a%20string,df)

[7]: stackoverflow, (2014). "*Convert Pandas Column to DateTime*". [stackoverflow](https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime)

[8]: Data to Fish, (2021). "*Select all Rows with NaN Values in Pandas DataFrame*". [Data to Fish](https://datatofish.com/rows-with-nan-pandas-dataframe/)

[9]: Lynn S., (n.d.). "*Delete Rows & Columns in DataFrames Quickly using Pandas Drop*".[Shane Lynn](https://www.shanelynn.ie/pandas-drop-delete-dataframe-rows-columns/)

[10]: Welck Aj, (n.d.). "*How to Check If Any Value is NaN in a Pandas DataFrame*". [Chartio](https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe/)

[11]:Shashank S., (2023). "*Defining, Analysing, and Implementing Imputation Techniques*". [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/06/defining-analysing-and-implementing-imputation-techniques/)

[12]: Kirch, Wilhelm, ed. (2008). "*Level of Measurement*". Encyclopedia of Public Health. [Springer Link](https://link.springer.com/referenceworkentry/10.1007/978-1-4020-5614-7_1971)

[13]: DATAtab Team (2023). "*Level of measurement*". [DATAtab: Online Statistics Calculator](https://datatab.net/tutorial/level-of-measurement)

[14]: GraphPad, (n.d.). "*What is the difference between ordinal, interval and ratio variables? Why should I care?*". [GraphPad](https://www.graphpad.com/support/faq/what-is-the-difference-between-ordinal-interval-and-ratio-variables-why-should-i-care/)

[15]: Stevens S.S., (1946). "*On the Theory of Scales of Measurement*". Science, Volum. 103, No. 2684

[16]: Bhat A., (n.d.). "*Levels of Measurement: Nominal, Ordinal, Interval & Ratio*". [QuestionPro](https://www.questionpro.com/blog/nominal-ordinal-interval-ratio/)

[17]: Statgraphics19, (n.d.). "*Categorical Data Analysis*". [Statgraphics19](https://www.statgraphics.com/categorical-data-analysis#:~:text=The%20Frequency%20Tables%20procedure%20analyzes,a%20set%20of%20multinomial%20probabilities.)

[18]: Ebner J., (2020). "*How to Use Pandas Unique to Get Unique Values*". [Sharp Sight](https://www.sharpsightlabs.com/blog/pandas-unique/)

[19]: yuktijain, (n.d.). "*How to replace multiple substrings of a string in Python?*". [Study Tonight](https://www.studytonight.com/python-howtos/how-to-replace-multiple-substrings-of-a-string-in-python)

### Additional readings

- Stackoverflow, (2020). "*Python how to fix year out of range error*".[stackoverflow](https://stackoverflow.com/questions/62130640/python-how-to-fix-year-out-of-range-error)
- Stackoverflow, (2017). "*Understanding inplace=True in pandas*". [Stackoverflow](https://stackoverflow.com/questions/43893457/understanding-inplace-true-in-pandas)
- matplotlib, (n.d.). "*matplotlib.pyplot.pie*". [matplotlib](https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.pie.html)
- Amipara K., (2017). "*Better visualization of Pie charts by MatPlotLib*".[Medium](https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f)
- Stackoverflow, (2020). "*How to plot 3 plots simultaneously in one plot?*". [Stackoverflow](https://stackoverflow.com/questions/61547691/how-to-plot-3-plots-simultaneously-in-one-plot)

***
End