# Exploratory Data Analysis (EDA)

<h3> What are the main characteristics of a car that have major impact in the price? </h3>

<h2 id="import_data">1. Importing pre processed data </h2>

<h4>Setup</h4>

Importing libraries

In [None]:
import pandas as pd
import numpy as np

loading data and storing it in a dataframe:

In [None]:
path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
df = pd.read_csv(path)
df.head()

<h2 id="pattern_visualization">2. Analyzing patterns of attributes individually using visualization</h2>

Import "Matplotlib" and "Seaborn" visualization packages.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

<h2> Continuous numeric variables:</h2> 

<p> Continuous numeric variables are variables that can hold any value within some range. Continuous numeric variables can have type "int64" or "float64". A great way to visualize these variables is to use scatterplots with regression lines. </p>

<p> To begin with, let's understand the (linear) relationship between an individual variable and price. We can do this using "regplot", which plots the scatterplot plus the regression line adjusted to the data. </p>

We'll see different examples of linear relationship.

<h4> Positive linear relationship </h4>

Let's see the scatter plot "engine-size" vs "price.

In [None]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)

<p> As the "engine-size" increases, the price increases: this indicates a direct positive correlation between these two variables. The "engine-size" looks like a good predictor of price, as the regression line is almost a perfect diagonal line.</p>

Is Highway mpg a good predictor of the price?

In [None]:
sns.regplot(x="highway-mpg", y="price", data=df)

<p> As Highway-mpg increases, price decreases: this indicates an inverse/negative relationship between these two variables. Highway-mpg can be a price predictor.</p>

<h3> Weak linear relationship </h3>

Let's see if "Peak-rpm" is a predictor of the "price".

In [None]:
sns.regplot(x="peak-rpm", y="price", data=df)

<p> Peak-rpm doesn't seem to be a good price indicator, since the regression line is close to the horizontal. In addition, the data points are very scattered and far from the adjusted line, showing a lot of variability. Therefore, it is not a reliable variable. </p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1>Exercise:</h1>

<p>Given the results of the correlation between "price" and "stroke", do you expect a linear relationship? Create the scatter plot.</p> 

</div>

# Visualizing relationships in pairs 

It creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution of each variable diagonally.

In [None]:
#sns.pairplot(df, kind='reg', diag_kind='kde')
sns.pairplot(df, vars=["engine-size","city-L/100km","stroke",'peak-rpm',"price"], kind='reg', diag_kind='kde')

<h3>Categorical variables</h3>

<p>These variables describe discrete attributes and are selected from a small group of categories. Categorical variables can have type "object" or "int64". A good way to visualize categorical variables, relative to a numeric variable, is by using boxplots.</p>

![Image](https://cdn1.byjus.com/wp-content/uploads/2020/10/Box-Plot-and-Whisker-Plot-1.png)

Let's see the relationship between "body-style" and "price".

In [None]:
sns.boxplot(x="body-style", y="price", data=df)

<p>We see that the price distribution between different body-style categories have significant overlap and therefore body-style would not be a good predictor of price. Let's examine the "engine-location" and "price":</p>

In [None]:
sns.boxplot(x="engine-location", y="price", data=df)

<p> Here we see that the price distribution between this two engine-location categories (front and rear) is different enough to consider the location of the engine a good predictor of the price.</p>

### Excercise: 
Let's analyse "drive-wheels" and "price".

<h2 id="descriptive_statistics">3. Descriptive Statistical Analysis</h2>

<p> First, let's take a look at the variables using "describe".</p>

<p>The <b>describe</b> function automatically calculates basic statistics for all continuous variables. 
calcula automaticamente estatísticas básicas para todas as variáveis ​​contínuas. Any NaN values are automatically ignored in these statistics. <p>

It'll show:
<ul>
    <li> count</li>
    <li> mean</li>
    <li> standard deviation(std)</li> 
    <li> minimum value</li>
    <li> maximum value</li>
    <li> IQR (interquartile range: 25%, 50% e 75%)</li>
<ul>

In [None]:
df.describe()

<h3>Value count</h3>

<p> Counting values is a good way to understand how many units of each feature/variable we have. We can apply the "value_counts" method on the 'drive-wheels' column. Don't forget that "value_counts" method works only on Pandas series, not on Pandas Dataframes. As a result, we only include one square bracket "df ['drive-wheels']" and not two square brackets "df [['drive-wheels']]".</p>

In [None]:
df['drive-wheels'].value_counts()

We can convert the series to a dataframe this way:

In [None]:
df['drive-wheels'].value_counts().to_frame()

Let's repeat the steps above, but save the results in a dataframe "drive_wheels_counts" and rename the column 'drive-wheels' to 'value_counts'.

In [None]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

Now let's rename the index column to 'drive-wheels':

In [None]:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts

### Exercise:

Repeat the process above to 'engine-location'.

<h2 id="basic_grouping">4. Grouping data </h2>

<p> The "groupby" method groups data by different categories. Data are grouped based on one or several variables and analysis is performed on the individual groups. </p>

<p> For example, let's group by the "drive-wheels" variable. We see that there are three different categories of drive wheels.</p>

In [None]:
df['drive-wheels'].unique()

<p> If we want to know, on average, what type of traction is most valuable, we can group "drive-wheels" and then calculate them. </p>

<p>We can select the columns 'drive-wheels' and 'price', or assign them to the variable "df_group_one".</p>

In [None]:
df_group_one = df[['drive-wheels','price']]

Then we can calculate the average price for each group. 

Obs: When as_index = True, the keys used in groupby() will become an index in the new dataframe.

In [None]:
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()

df_group_one

<p>According to our data, it seems that rear-wheel drive vehicles are, on average, the most expensive, while 4x4 and front-wheel drive vehicles are roughly the same price.</p>

<p>You can also group by multiple variables. For example, let's group by 'drive-wheels' and 'body-style'. This groups the dataframe by unique 'drive-wheels' and 'body-style' combinations. We can store the results in the 'grouped_test1' variable.</p>

In [None]:
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1

<p>This grouped data is much easier to visualize when it's turned into a pivot table. A pivot table is like an Excel spreadsheet, with one variable in the column and one in the row. We can convert the dataframe into a pivot table using the "pivot" method to create a pivot table from the groups. </p>

<p>In this case, we'll put the drive-wheel variable as the table rows and "pivot" the body-style to become the table columns:</p>

In [None]:
grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot

<h1>Exercise:</h1>

<p>Use "groupby" to find the average price according to "body-style" ? </p>

<h4>Variables: Drive Wheels and Body Style vs Price</h4>

Let's use a heatmap to visualize the relationship between body-style and preço.

In [None]:
sns.heatmap(grouped_pivot, cmap="YlGnBu")

<p>The heatmap represents the target variable (price) proportional to the color in relation to the 'drive-wheels' and 'body-style' variables on the vertical and horizontal axis, respectively. This allows us to visualize how price is related to 'drive-wheel' and 'body-style'.</p>

<p>Visualization is very important in data science, and Python's visualization packages give you great freedom.</p>

<p>The main question we want to answer in this module is "What are the main features that have the biggest impact on the price of the car?".</p>

<p>To get a better measure of the important features, we examine the correlation of these variables with the price of the car, i.e.: how does the price of the car depend on this variable?</p>

<h2>5. Correlation and causality</h2>

<p><b>Correlação</b> : a measure of the extent of interdependence between variables. </p>

<p><b>Causalidade</b>: the relationship of cause and effect between two variables.</p>

<p> It is important to know the difference between these two and that correlation does not imply causation. Determining the correlation is much simpler than the determining cause, as the cause may require independent experimentation.</p>

<p3>Pearson correlation</p>
<p> Pearson Correlation measures the linear dependence between two variables X and Y. </p>
<p> The resultant coefficient is a value between -1 and 1, in which: </p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation </li>
    <li><b>0</b>: Zero correlation, the two variables are unlikely to affect each other </li>
    <li><b>-1</b>: Perfect negative correlation</li>
</ul>

<p>Pearson correlation is the default method of the "corr" function. As before, we can calculate the Pearson correlation of the variables 'int64' or 'float64'.</p>

In [None]:
df.corr()

In [None]:
P_corr=df.corr()
plt.subplots(figsize=(10, 7))
sns.heatmap(P_corr)


Sometimes we would like to know the statistical significance of the correlation estimate.

<b>P-value</b>: 
<p>What is P-value? The P-value is the probability that the correlation between these two variables is statistically significant. Typically, we choose a significance level of 0.05, which means we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the
<ul>
     <li>p-value is < 0.001: we say that there is strong evidence that the correlation is significant.</li>
     <li>p-value is < 0.05: there is moderate evidence that the correlation is significant.</li>
     <li>p-value is < 0.1: there is weak evidence that the correlation is significant.</li>
     <li>p-value is > 0.1: there is no evidence that the correlation is significant.</li>
</ul>

We can get this information using the module "stats" from the library "scipy".

In [None]:
from scipy import stats

<h3>Wheel-base vs Price</h3>

Let's calculate the Pearson's correlation coefficient and the P-value of 'wheel-base' and 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

<h5>Conclusion:</h5>
<p>As the p-value is < 0.001, the correlation between 'wheel-base' and 'price' is statistically significant, although the linear relationship is not extremely strong (~0.585)</p>

<h3>Horsepower vs Price</h3>

Let's calculate Pearson's correlation coefficient and the P-value of 'horsepower' and 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

<h5>Conclusion:</h5>
<p>As the p value is < 0.001, the correlation between horsepower and price is statistically significant and the linear relationship is quite strong (~ 0.809, close to 1)</p>

<h3>Length vs Price</h3>

Let's calculate Pearson correlation coefficient and the P-value of 'length' and 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

<h5>Conclusion:</h5>
<p>As the p value is < 0.001, the correlation between "length" and "price" is statistically significant and the linear relationship is moderately strong (~ 0.691).</p>

<h3>Width vs Price</h3>

Let's calculate Pearson correlation coefficient and the P-value of 'width' and 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 

<h5>Conclusion:</h5>
<p>As the p-value is < 0.001, the correlation between width and price is statistically significant and the linear relationship is moderately strong (~0.751). </p>

### Exercise: 
Analyze the correlation of the price with the variables: 'Curb-weight', 'Engine-size' , 'Bore', 'City-mpg' and 'Highway-mpg'. What is your conclusion?

### Curb-weight vs Price

<h3>Engine-size vs Price</h3>

<h3>Bore vs Price</h3>

<h3>City-mpg vs Price</h3>

<h3>Highway-mpg vs Price</h3>

<h2 id="anova">6. ANOVA</h2>

<h3>ANOVA: Analysis of variance</h3>
<p>The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes that all groups' means are equal, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. The higher the score, the greater the difference between the means.</p>

<p><b>P-value</b>: P-value indicates how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, expect the ANOVA to return a considerable F-test score and a small p-value.</p>

![alt text](https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/r-statistics-2-comparing-groups-means/images/one-way-anova-basics.png)

<h3>Drive Wheels</h3>

<p>As ANOVA analyzes the difference between different groups of the same variable, the groupby function will be useful. Since the ANOVA algorithm automatically averages the data, we don't need to calculate the average in advance.</p>

<p>Let's see how the different types of 'drive-wheels' impact the 'price', let's group the data.</p>

In [None]:
grouped_test2=df[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head()

We can get the group values using "get_group".

In [None]:
grouped_test2.get_group('4wd')['price']

We can use the function 'f_oneway' from the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.

In [None]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   

This is a great result, with a large F-test score showing a strong correlation and a P-value close to 0 implying almost certain statistical significance. But does this mean that all three tested groups are highly correlated?

#### Analyzing separately: fwd and rwd

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )

Let's examine the other groups.

#### 4wd and rwd

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)   

<h4>4wd and fwd</h4>

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
 
print("ANOVA results: F=", f_val, ", P =", p_val)   

### Exercise:
Perform variance analysis for variable 'horsepower-binned'.

<h3>Conclusion: Important variables</h3>

In [None]:
grouped_test_horse=df[['horsepower-binned', 'price']].groupby(['horsepower-binned'])
grouped_test_horse.head()
f_val, p_val = stats.f_oneway(grouped_test_horse.get_group('Low')['price'], grouped_test_horse.get_group('Medium')['price'], grouped_test_horse.get_group('High')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)  


<p>Now we have a better idea of what our important data looks like and what variables to consider when predicting the price of the car. We've summarized it to the following variables:</p>

Continuous numeric variables:
<ul>
    <li>Length</li>
    <li>Width</li>
    <li>Curb-weight</li>
    <li>Engine-size</li>
    <li>Horsepower</li>
    <li>City-mpg</li>
    <li>Highway-mpg</li>
    <li>Wheel-base</li>
    <li>Bore</li>
</ul>
    
Categorical variables:
<ul>
    <li>Drive-wheels</li>
</ul>

<p>As we move towards building machine learning models to automate our analysis, feeding the model with variables that significantly affect our target variable will improve our model's prediction performance.</p>

<h3>About the Authors:</h3>

This notebook was written by <a href="https://www.linkedin.com/in/mahdi-noorian-58219234/" target="_blank">Mahdi Noorian PhD</a>, <a href="https://www.linkedin.com/in/joseph-s-50398b136/" target="_blank">Joseph Santarcangelo</a>, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and <a href="https://www.linkedin.com/in/fiorellawever/" target="_blank">Fiorella Wenver</a> and <a href=" https://www.linkedin.com/in/yi-leng-yao-84451275/ " target="_blank" >Yi Yao</a>.

<p><a href="https://www.linkedin.com/in/joseph-s-50398b136/" target="_blank">Joseph Santarcangelo</a> is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.</p>

Adapted by Carlos Carlim

<hr>
<p>Copyright &copy; 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the <a href="https://cognitiveclass.ai/mit-license/">MIT License</a>.</p>