## Programming of Data Analysis Project 1

**Francesco Troja**

***

Project 1

>Create a data set by simulating a real-world phenomenon of your choosing. Then rather than collect data related to the phenomenon, you should model and synthesise such data using Python.We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:
>- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
>- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
>- Synthesise/simulate a data set as closely matching their properties as possible.
>- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.


#### Installations

To execute this project, several Python libraries have been utilized. These libraries were chosen for their specific functionalities and capabilities, tailored to the requirements of the project:
1. `padas`: The library's powerful data structures, including DataFrames and Series, allowed for efficient organization and structuring of data, making it easy to perform various data operations, such as filtering, grouping, and aggregating.Pandas offered a wide range of functions for data cleaning and preparation, making it ideal for addressing real-world data challenges[1].

In [1]:
import pandas as pd

#### Importing the Dataset

The provided dataset offers an extensive and detailed record of property house prices in Ireland, spanning the period from 2010 to 2023. It plays a significant role in capturing essential information about the real estate market in Ireland. This dataset provides valuable insights into the dynamics of property prices, trends, and fluctuations over a thirteen-year period. The housing market is a crucial element of a country's economy, and property prices reflect not only home values but also broader economic conditions, as well as the forces of supply and demand at play.

The dataset includes a wide range of variables, including the date of sale, property location, property type, and the sale price. These variables offer a rich source of information for analysis, allowing for the examination of various aspects of the property market, such as regional variations, property types, and the impact of economic events on house prices.

The dataset used was was discovered on the [Kaggle](https://www.kaggle.com/datasets/raphaelmapp/ireland-house-prices-2010-to-2023/code) website.

In Python, working with CSV files often involves using the `read_csv()` function from the Pandas library. This function acts as a crucial tool, facilitating the smooth import of CSV files into a Pandas DataFrame. The DataFrame represents the data in a structured format, enabling easy manipulation and analysis. The DataFrame format, offered by Pandas, facilitates straightforward data exploration, manipulation, and analysis. In order to import the csv file, the file path is passed as parameter. This file path specifies the location of the CSV file you want to import. The read_csv() function then reads the data from that file and converts it into a Pandas DataFrame[2].

In [2]:

df = pd.read_csv("Housing_Data_Jan2010_to_May2023_Cleaned.csv")

print (f'The dataset used is:\n {df}')

The dataset used is:
               Date                                            Address  \
0       28/11/2018             ABBEY GLEN, OFF POTTERY RD, CABINTEELY   
1       15/12/2016                   WALFORD, SHREWSBURY RD, DUBLIN 4   
2       29/03/2013                 Walford, Shrewsbury Road, Dublin 4   
3       19/05/2021                     Uimhi a Naoi, B�tha Sri�sbaire   
4       20/11/2019  Property at Castletown Demesne, Carrick-on-Sui...   
...            ...                                                ...   
597523         NaN                                                NaN   
597524         NaN                                                NaN   
597525         NaN                                                NaN   
597526         NaN                                                NaN   
597527         NaN                                                NaN   

          County       Price  Full_Market_Price  VAT_Exclusive  \
0         Dublin  14800000.0       

#### Data Exploration

Let's investigate the dataset's  structure and characteristics. Statistical analysis is a method for uncovering patterns and correlations in data. The goal is to provide a descriptive overview of the dataset and its variables. Let's have a look at the dataset's contents:

- The Pandas `head()` method is used to return the top n (default is 5) rows from a dataset.
- The Pandas `tail()` method is used to return the bottom n (default is 5) rows from a dataset[3].


In [3]:
print("the first 5 rows of the dataset:\n")
df.head()


the first 5 rows of the dataset:



Unnamed: 0,Date,Address,County,Price,Full_Market_Price,VAT_Exclusive,Description_of_Property,Property_Size_Description
0,28/11/2018,"ABBEY GLEN, OFF POTTERY RD, CABINTEELY",Dublin,14800000.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
1,15/12/2016,"WALFORD, SHREWSBURY RD, DUBLIN 4",Dublin,14250000.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
2,29/03/2013,"Walford, Shrewsbury Road, Dublin 4",Dublin,14000000.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
3,19/05/2021,"Uimhi a Naoi, B�tha Sri�sbaire",Dublin,13250000.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
4,20/11/2019,"Property at Castletown Demesne, Carrick-on-Sui...",Kilkenny,12600000.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description


In [4]:
print("the last 5 rows of the dataset:\n")
df.tail()

the last 5 rows of the dataset:



Unnamed: 0,Date,Address,County,Price,Full_Market_Price,VAT_Exclusive,Description_of_Property,Property_Size_Description
597523,,,,,,,,
597524,,,,,,,,
597525,,,,,,,,
597526,,,,,,,,
597527,,,,,,,,


As evident from the provided code, the selected dataset comprises 597,527 rows and 8 columns. Additionally, it's apparent from this initial analysis that the dataset contains missing values. The dimensionality of the dataset, can be confirmed using the Pandas function `shape` that when used it returns a tuple where the first element represents the number of rows (observations) and the second element indicates the number of columns (variables) in the dataset[4].

In [5]:
print('The dimensions of the dataset are:\n')
df.shape

The dimensions of the dataset are:



(597528, 8)

To gain further insights into the DataFrame, the `info()` function can be used. This function provides metadata about the DataFrame, including the column names, the count of non-null values in each column, and the data type for each column[5]:

In [6]:
print('Find below the full summary of the Dataset:\n')
df.info()

Find below the full summary of the Dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 597528 entries, 0 to 597527
Data columns (total 8 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Date                       596655 non-null  object 
 1   Address                    596655 non-null  object 
 2   County                     596655 non-null  object 
 3   Price                      596655 non-null  float64
 4   Full_Market_Price          596644 non-null  float64
 5   VAT_Exclusive              596644 non-null  float64
 6   Description_of_Property    596644 non-null  object 
 7   Property_Size_Description  596655 non-null  object 
dtypes: float64(3), object(5)
memory usage: 36.5+ MB


The analysis of the dataset reveals the following key findings:

1. The dataset consists of 8 columns:

    - Date
    - Address
    - County
    - Price
    - Full_market_price
    - VAT_Exclusive
    - Description_of_Property
    - Property_Size_Description

2. It is apparent that there are missing values present in some of the columns.
3. The dataset is composed of a mix of data types. Specifically, there are 5 columns with object data types (string of Text or mixed numeric and non-numeric values) and 3 numerical columns(Floating point numbers), which likely contain numeric information[6].

The analysis highlights an issue with the "date" column being stored as an object data type, which limits its utility for datetime operations. To resolve this, converting the "date" column to the datetime64 data type is necessary. The `to_datetime()` function, a part of Pandas, serves this purpose. By using this function, the "date" column can be transformed into a format that enables effective datetime operations on the dataset[7]. 


In [7]:
#convert Date into datatimes type
df["Date"] = pd.to_datetime(df["Date"], errors='coerce')
df["Date"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 597528 entries, 0 to 597527
Series name: Date
Non-Null Count   Dtype         
--------------   -----         
596654 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 4.6 MB


  df["Date"] = pd.to_datetime(df["Date"], errors='coerce')


As observed in the previous examination using the tail() function, it became apparent that there are entire rows containing null values in the dataset.To determine the precise count of rows with null values, `isnull()` function can be used. The function identifies and flags the presence of missing or null values in the dataset. This understanding is essential before proceeding with further data analysis and allows for informed decision-making regarding how to handle these null value rows[8].

In [8]:
#Taking Date Variable as references
df[df['Date'].isnull()]

Unnamed: 0,Date,Address,County,Price,Full_Market_Price,VAT_Exclusive,Description_of_Property,Property_Size_Description
511,NaT,"56 EGLINTON RD, DONNYBROOK, DUBLIN 4",Dublin,3300000.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
596655,NaT,,,,,,,
596656,NaT,,,,,,,
596657,NaT,,,,,,,
596658,NaT,,,,,,,
...,...,...,...,...,...,...,...,...
597523,NaT,,,,,,,
597524,NaT,,,,,,,
597525,NaT,,,,,,,
597526,NaT,,,,,,,


Upon analysis, it's evident that the total number of rows containing null values amounts to 873. Given that these rows do not contain any valuable information, the next step involves removing them from the dataset. The function `drop()` is used for this purpose, which eliminates rows based on their index values. Typically, the index value represents a 0-based integer value assigned to each row. By specifying the row index, it can be deleted from the dataset[9].

In [22]:
#using inplace to change the dataset
df.drop(df.index[596655:597528], inplace= True)
df.tail()

Unnamed: 0,Date,Address,County,Price,Full_Market_Price,VAT_Exclusive,Description_of_Property,Property_Size_Description
596650,2018-09-04,"65 ST JOSEPH'S PARK, NENAGH, CO TIPPERARY",Tipperary,5080.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
596651,2014-07-18,"CLOGHAN, GLENCOLMCILLE, DONEGAL",Donegal,5079.0,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
596652,2012-01-11,"Loghnabradden, Fintown, Co. Donegal",Donegal,5079.0,1.0,0.0,Second-Hand Dwelling house /Apartment,No Description
596653,2019-03-11,"COULAGHARD, EYERIES, BEARA",Cork,5030.53,0.0,0.0,Second-Hand Dwelling house /Apartment,No Description
596654,2023-01-10,"14 KNIGHTS PARK, CASTLEBAR, MAYO",Mayo,5001.0,1.0,0.0,Second-Hand Dwelling house /Apartment,No Description


The analysis will now shift to examining the missing values within the variables of the dataset. To determine of many missing values exist for each variable the `sum()` function can be chained on the `isnull()`[10].

In [23]:
print('The missing values are:\n')
df.isnull().sum()

The missing values are:



Date                          1
Address                       0
County                        0
Price                         0
Full_Market_Price            11
VAT_Exclusive                11
Description_of_Property      11
Property_Size_Description     0
dtype: int64

The result of this analysis reveals that out of the 8 variables in the dataset, 5 of them contain missing values. Specifically, the "Date" variable has 1 missing value, while the "Full_Market_Price," "VAT_Exclusive," and "Description_of_Property" variables each have 11 missing values. Indeed, the approach to handling missing values depends on the specific analysis or task at hand. Identifying and understanding the nature of missing data is a critical aspect of data preparation. Depending on the goals of the analysis, various actions can be taken with missing values, including imputation (filling in missing values with estimated or calculated data) or employing data cleaning techniques to ensure the dataset's quality and suitability for the intended analysis[11].

#### Statistical information

### References

[1]: Chugh v., (2023). "*Python pandas tutorial: The ultimate guide for beginners*".[Datacamp](https://www.datacamp.com/tutorial/pandas)

[2]: Analyseup, (n.d.). "*Importing Data with Pandas*". [Analyseup](https://www.analyseup.com/learn-python-for-data-science/python-pandas-importing-data.html)

[3]: Shazra H., (2023). "*head () and tail () Functions Explained with Examples and Codes*". [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2023/07/head-and-tail-functions/)

[4]: Pandas, (n.d.). "*pandas.DataFrame.shape*".[Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)

[5]: Rajan S., (2023). "*Python | Pandas dataframe.info()*". [geeksforgeeks](https://www.geeksforgeeks.org/python-pandas-dataframe-info/)

[6]: Moffitt C., (2018). "*Overview of Pandas Data Types*". [Practical Business Python](https://pbpython.com/pandas_dtypes.html#:~:text=An%20object%20is%20a%20string,df)

[7]: stackoverflow, (2014). "*Convert Pandas Column to DateTime*". [stackoverflow](https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime)

[8]: Data to Fish, (2021). "*Select all Rows with NaN Values in Pandas DataFrame*". [Data to Fish](https://datatofish.com/rows-with-nan-pandas-dataframe/)

[9]: Lynn S., (n.d.). "*Delete Rows & Columns in DataFrames Quickly using Pandas Drop*".[Shane Lynn](https://www.shanelynn.ie/pandas-drop-delete-dataframe-rows-columns/)

[10]: Welck Aj, (n.d.). "*How to Check If Any Value is NaN in a Pandas DataFrame*". [Chartio](https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe/)

[11]:Shashank S., (2023). "*Defining, Analysing, and Implementing Imputation Techniques*". [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/06/defining-analysing-and-implementing-imputation-techniques/)

### Additional readings

- Stackoverflow, (2020). "*Python how to fix year out of range error*".[stackoverflow](https://stackoverflow.com/questions/62130640/python-how-to-fix-year-out-of-range-error)
- Stackoverflow, (2017). "*Understanding inplace=True in pandas*". [Stackoverflow](https://stackoverflow.com/questions/43893457/understanding-inplace-true-in-pandas)

***
End