<b><h1 style="font-size:30px;text-align: center;">- Date Engineering  - Exercise 7 -</h1></b>
<b><h1 style="font-size:25px;text-align: center;">- Exploratory Data Analysis in Python -</h1></b>

<center><img src="https://www.algebra.hr/sveuciliste/wp-content/uploads/2023/11/algebra_UNIVERSITY-1-800x242.png"/></center>

=======================================================================================================


<b>*Made: Novemeber 2024.* </b>

<b>*Author: Mislav Spajić, mag. ing. comp.*</b>

# Exploratory Data Analysis in Python

## Introduction

**What is Exploratory Data Analysis ?**

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. This step is very important especially when we arrive at modeling the data in order to apply Machine learning. Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

**How to perform Exploratory Data Analysis ?**

This is one such question that everyone is keen on knowing the answer. Well, the answer is it depends on the data set that you are working. There is no one method or common methods in order to perform EDA, whereas in this tutorial you can understand some common methods and plots that would be used in the EDA process.

**What data are we exploring today ?**



Since I am a huge fan of cars, I got a very beautiful data-set of cars from Kaggle. The data-set can be downloaded from [here](https://www.kaggle.com/CooperUnion/cardataset). To give a piece of brief information about the data set this data contains more of 10, 000 rows and more than 10 columns which contains features of the car such as Engine Fuel Type, Engine HP, Transmission Type, highway MPG, city MPG and many more. So in this tutorial, we will explore the data and make it ready for modeling.



---



## 1. Importing the required libraries for EDA

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this part.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation



---



## 2. Loading the data into the data frame.

Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we can see that the value from the data set is comma-separated. So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

In [None]:
df = pd.read_csv("cars.csv")
# To display the top 5 rows 
df.head(5)               

In [None]:
df.tail(5) # To display the botton 5 rows

**Question to ask during the discovery phase:**
1. How can I break this data into smaller groups so that I can understand it better?
2. How can I prove my hypothesis?
3. In its current form, can this data give me the answers I need?

**Functions for data discovery:**

| Function | Description |
| ---- | ---- |
| `DataFrame.head()` | The head() method will display the first n rows of the dataframe. <br> In the argument field, input the number of rows you want displayed in a Python notebook. The default is 5 rows. |
| `DataFrame.info(X)` | The info() method will display a summary of the dataframe, including the range index, dtypes, column headers, and memory usage.<br> Leaving the argument field blank will return a full summary. As an option, in the argument field you can type in show_counts=True, which will return the count of non-null values for each column. |
| `DataFrame.describe()` | The describe() method will return descriptive statistics of the entire dataset, including total count, mean, minimum, maximum, dispersion, and distribution. <br> Leaving the argument field blank will default to returning a summary of the data frame’s statistics. As an option, you can use “include=[X]” and “exclude=[X]” which will limit the results to specific data types, depending on what you input in the brackets. | 
| `DataFrame.shape` | shape is an attribute that returns a tuple representing the dimensions of the dataframe by number of rows and columns. Remember that attributes are not followed by parentheses. |



---



## 3. Basic exploration

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.

In [None]:
df.dtypes

In [None]:
# Gather basic information about the dataset
df.info()

In [None]:
# Gather descriptive statistics about the data
df.describe()

In [None]:
# Display the size of the dataframe
df.shape



---



## 4. Out-of-the box more advanced EDA with ydata profiling

In [None]:
from ydata_profiling import ProfileReport

In [None]:
profile = ProfileReport(df, title="Cars Dataset Profiling Report")

In [None]:
profile

## 5. Dropping irrelevant columns

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only solution. In this case, the columns such as Engine Fuel Type, Market Category, Vehicle style, Popularity, Number of doors, Vehicle Size doesn't make any sense to me so I just dropped for this instance.

In [None]:
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)



---



## 6. Renaming the columns

In this instance, most of the column names are very confusing to read, so I just tweaked their column names. This is a good approach it improves the readability of the data set.

In [None]:
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)



---



## 7. Duplicated Data


**Identifying duplicates** 
A simple way to identify duplicates is to use the  `pd.duplicated()` function from Pandas. This function returns a series of “true/false” outputs, with “true” indicating the data value is a duplicate, and “false” indicating it is a unique value.

**Keeping or Dropping Duplicates** Every dataset is unique and you cannot treat every dataset the same. When making the decision on whether to eliminate duplicate values or not, think deeply about the dataset itself and about the objective you wish to achieve. What impact will dropping duplicates have on your dataset and your objective? 
1. **Deciding to drop |** You should drop or eliminate duplicate values if duplicate values are clearly mistakes or will misrepresent the remaining unique values in the dataset.  
2. **Deciding to NOT drop |** You should keep duplicated data in your dataset if the duplicate values are clearly not mistakes and should be taken into account when representing the dataset as a whole. 

Dropping duplicates is often a handy thing to do because a huge data set as in this case contains more than 10, 000 rows often have some duplicate data which might be disturbing, so here I remove all the duplicate value from the data-set. For example prior to removing I had 11914 rows of data but after removing the duplicates 10925 data meaning that I had 989 of duplicate data.

In [None]:
df.shape

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

Now let us remove the duplicate data because it's ok to remove them.

In [None]:
df.count()      # Used to count the number of rows

So seen above there are 11914 rows and we are removing 989 rows of duplicate data.

In [None]:
df = df.drop_duplicates()
df.head(5)

In [None]:
df.count()



---



## 8. Dropping the missing or null values.

This is mostly similar to the previous step but in here all the missing values are detected and are dropped later. Now, this is not a good approach to do so, because many people just replace the missing values with the mean or the average of that column, but in this case, I just dropped that missing values. This is because there is nearly 100 missing value compared to 10, 000 values this is a small number and this is negligible so I just dropped those values.

In [None]:
print(df.isnull().sum())

This is the reason in the above step while counting both Cylinders and Horsepower (HP) had 10856 and 10895 over 10925 rows.

In [None]:
df = df.dropna()    # Dropping the missing values.
df.count()

Now we have removed all the rows which contain the Null or N/A values (Cylinders and Horsepower (HP)).

In [None]:
print(df.isnull().sum())   # After dropping the values



---



## 9. Detecting Outliers

**Outliers |** Observations that are an abnormal distance from other values or an overall pattern in a data population

**3 Types of Outliers**
- Global outliers 
- Contextual outliers
- Collective outliers

**Global Outliers |** Values that are completely different from the overall data group and have noa association with any other outliers

**Contextual outliers |** Normal data points under certain conditions but become anomalies under most other conditions 

**Collective outliers |** A group of abnormal point that follow similar patterns and are isolated from the rest of the population 

**How to handle outliers** 

It is important to not only detect outliers, but also to have a plan for them.

Whether you keep outliers as they are, delete them, or reassign values is a decision that you make on a dataset-by-dataset basis. To help you make the decision, you can start with these general guidelines:

- **Delete them**: If you are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then you are more likely to decide to delete outliers. Of the three choices, you’ll use this one the least.
- **Reassign them**: If the dataset is small and/or the data will be used for modeling or machine learning, you are more likely to choose a path of deriving new values to replace the outlier values.
- **Leave them**: For a dataset that you plan to do EDA/analysis on and nothing else, or for a dataset you are preparing for a model that is resistant to outliers, it is most likely that you are going to leave them in.

**Useful Functions:**

| Function | Description |
| ---- | ---- |
`df.describe()` | A DataFrame method that returns general statistics about the dataframe which can help determine outliers |
`sns.boxplot()` | A seaborn function that generates a box plot. Data points beyond 1.5x the interquartile range are considered outliers. |

In [None]:
sns.boxplot(x=df['Price'])
plt.show()

In [None]:
sns.boxplot(x=df['HP'])
plt.show()

In [None]:
sns.boxplot(x=df['Cylinders'])
plt.show()

In [None]:
# Select only numerical columns from the DataFrame
numerical_df = df.select_dtypes(include=['number'])

In [None]:
Q1 = numerical_df.quantile(0.25)
Q3 = numerical_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Don't worry about the above values because it's not important to know each and every one of them because it's just important to know how to use this technique in order to remove the outliers.

In [None]:
# Filter the DataFrame to remove outliers
df_filtered = numerical_df[~((numerical_df < (Q1 - 1.5 * IQR)) | (numerical_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# Check the new shape
df_filtered.shape

In [None]:
# Filter the original DataFrame using the index of filtered rows
df = df.loc[df_filtered.index]
df.shape

As seen above there were around 1600 rows were outliers. But you cannot completely remove the outliers because even after you use the above technique there maybe 1–2 outlier unremoved but that ok because there were more than 100 outliers. Something is better than nothing.



---



## 10. Plot different features against one another (scatter), against frequency (histogram)

### Histogram

Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly 10 different types of car manufacturing companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know the total number of car manufactured by a different company.

In [None]:
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

### Heat Maps

Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best way to find the relationship between the features can be done using heat maps. In the below heat map we know that the price feature depends mainly on the Engine Size, Horsepower, and Cylinders.

In [None]:
plt.figure(figsize=(10,5))
c= numerical_df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c

### Scatterplot

We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we can see the plot below. With the plot given below, we can easily draw a trend line. These features provide a good scattering of points.

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['HP'], df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()

In [None]:
# Ensure the data does not contain NaN values
df = df.dropna(subset=['HP', 'Price'])

# Extract the relevant columns
x = df['HP']
y = df['Price']

# Calculate the coefficients for the trend line (linear regression)
coefficients = np.polyfit(x, y, 1)  # Degree 1 for a linear trend
trend_line = np.poly1d(coefficients)

# Generate the y-values for the trend line
trend_y = trend_line(x)

# Plot the scatter plot with the trend line
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x, y, alpha=0.7, label='Data Points')
ax.plot(x, trend_y, color='red', label='Trend Line', linewidth=2)
ax.set_xlabel('HP')
ax.set_ylabel('Price')
ax.set_title('Scatter Plot of HP vs Price with Trend Line')
ax.legend()
plt.show()

**Hence the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow in order to perform EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets. Stay tuned for more updates.**