# Exploratory Data Analysis Exercise
* For this part we will be using the `data/cars.csv` dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline 
import scipy.stats as stats

df = pd.read_csv('data/cars.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Convertible,28,18,3916,34500


# Load in the data
* Use the file in the data folder called 'cars.csv'
* Save it as a varible named 'df'
* Display the first 5 rows of our dataframe

In [5]:
# Load data

```py
# Load data


df = pd.read_csv('data/cars.csv')
df.head()

```

# Data clean up part 1.

1. Print the number of duplicate rows we have in our dataframe.

2. Modify our df to have all duplicate rows removed. 

3. Do a sanity check to make sure all duplicates have been removed by printing the total number of duplicate rows again.

In [2]:
# 1. Print the number of duplicate rows we have in our dataframe.



In [3]:
#  2. Modify our df to have all duplicate rows removed. 

print(df.shape)

(11914, 15)


In [6]:
# 3. Do a sanity check to make sure all duplicates have been removed by printing the total number of duplicate rows again.



```py
# 1. Print the number of duplicate rows we have in our dataframe.
print(df.duplicated().sum())

#  2. Modify our df to have all duplicate rows removed. 

print(df.shape)


df = df.drop_duplicates()


print(df.duplicated().sum())

# 3. Do a sanity check to make sure all duplicates have been removed by printing the total number of duplicate rows again.

print(df.duplicated().sum())
```

# Data clean up part 2.
* Which column has the most null values and how many null values does it have?
* Print how long our dataframe is.
* Remove any row that has a null value in it. 
* Do a sanity check and pring how long our dataframe is now that we have removed our null values.

In [7]:
# * Which column has the most null values and how many null values does it have?



In [8]:
# * Print how long our dataframe is.



In [9]:
# * Remove any row that has a null value in it. 



In [10]:
# * Do a sanity check and pring how long our dataframe is now that we have removed our null values.



```py

# * Which column has the most null values and how many null values does it have?
df.isnull().sum()

# * Print how long our dataframe is.

df.shape


# * Remove any row that has a null value in it. 
df = df.dropna()


# * Do a sanity check and pring how long our dataframe is now that we have removed our null values.


df.shape


```



### Make a bar chart that displays how many times each brand of car appears in this data. 
_Brand of car is the `Make` of the car._
* You can achieve this by using value_counts or by a groupby.  Either is fine with me. 

In [11]:
# Make a bar chart that displays how many times each brand of car appears in this data. 




```py

fig, ax = plt.subplots( nrows=1, ncols=1 )


gb = df.groupby('Make')
count_makes = gb['Make'].count()
count_makes.plot(kind='bar')

fig.savefig("./path.png")
plt.close(fig)

from flask import Flask

app = Flask("Types")

@app.route('/')
def hello_world():
    return 'Hello, World!'
    

```




# Make the cart more legible, by making it a horizontal bar chart and changing the figure size.  And also sort the values so the bar chart displays from lowest to highest.

In [12]:
# Make the cart more legible, by making it a horizontal bar chart, sorting the values, and changing the figure size.




```py
# Make the cart more legible, by making it a horizontal bar chart, sorting the values, and changing the figure size.

gb = df.groupby('Make')
count_makes = gb['Make'].count().sort_values()
count_makes.plot(kind='barh', figsize=(13,21))

```

![image.png](attachment:59fab592-be30-4c9d-9bd6-d28c694d27ce.png)

# Make a timeline line chart in which the x-axis is the year, and the y-axis is the average MSRP.
* What's noticeable about it and what do you think the error is...


In [13]:
# Make a timeline line chart in which the x-axis is the year, and the y-axis is the average MSRP.


```py
df.head(10)

# Make a timeline line chart in which the x-axis is the year, and the y-axis is the average MSRP.


gb = df.groupby('Year')
average_MSRP = gb['MSRP'].mean()
average_MSRP.plot(kind='line', figsize=(8,5))
```
![image.png](attachment:181391d4-98c0-4126-9d06-67eaa6131444.png)


# It seems as though in the years before (and includig) 2000, they were counting in tens.
Make a new column that is called `adjusted_price`, that contains all prices, however, for every year before 2000 make it 10x the original MSRP.  
   * Hint; you might need to use our old friend `np.where`

In [14]:
# Make a column where is 10 when year is less than 2000, else 1.


# Multiply the MSRP by the price adjuster.



```py
df["MSRP"]

# Make a column where is 10 when year is less than 2000, else 1.
def adjust_price_year(x):
#     print(x)
    if x<=2000:
        return 10
    else:
        return 1

# Multiply the MSRP by the price adjuster.

df["adjusted_price"] = df["MSRP"]
df['adjusted_price'] *= df['Year'].apply(adjust_price_year)

df.head()

```

# Replot the new adjusted price.  
* Make the y-axis start at 0 and go up to 100,000

In [15]:
# Plot new prices



```py
# Plot new prices

gb = df.groupby('Year')
average_MSRP = gb['adjusted_price'].mean()
average_MSRP.plot(kind='line', figsize=(8,5))




```

![image.png](attachment:23864c0a-1752-4c35-bd14-daee8ab99ed0.png)

# What are top 5 car makers that make the most expensive cars on average. 
* I only want the top 5, make sure your answer is the top 5 and only the top 5. (hint, you can use .head())
* Use our `adjusted_price` column for this
* Hint; you're going to have to do a .groupby to answer this.

In [16]:
# What are the top 5 car makers make the most expensive cars on average. 



```py
# What are the top 5 car makers make the most expensive cars on average. 

gb = df.groupby('Make')
count_makes = gb['adjusted_price'].mean().sort_values(ascending=False)
count_makes[:5].plot(kind='barh')

```
![image.png](attachment:52970a32-739c-4ab9-9595-9522f02e6cb8.png)

```py
# What are the top 5 car makers make the most expensive cars on average. 
gb = df.groupby('Make')
expensive_makes = gb['adjusted_price'].mean().sort_values(ascending=False).head(5)
expensive_makes.plot(kind='barh');
```

![image.png](attachment:c00a8b0f-6632-440b-a61a-a13aaa0c14a2.png)

# What are the 5 car makers that have the highest median highway MPG?

In [17]:
# Which car makers have the highest median highway MPG?



```py
cond = ""

# Which car makers have the highest median highway MPG?


gb = df.groupby('Make')
count_makes = gb['highway MPG'].median().sort_values(ascending=False)
count_makes[:5].plot(kind='barh')

```

![image.png](attachment:04e7f2b9-c50b-4d36-8ddc-dc7ed05304e7.png)

# Using `sns.histplot`, make histogram of the adjusted_price of just these car makers.
* ['Chevrolet', 'Ford', 'Toyota']
* Create a temp_df to store the dataframe of just these values.
* Set the 'hue='Make''.

In [19]:
# Using `sns.histplot`, make histogram of the adjusted_price of just these car makers.



```py
import seaborn as sns

condC = df["Make"] == "Chevrolet"
condF = df["Make"] == "Ford"
condT = df["Make"] == "Toyota"

# Using `sns.histplot`, make histogram of the adjusted_price of just these car makers.
sns.set()

temp_df = df[condC | condF | condT]

ax = sns.histplot(data=temp_df, x="Year", y="adjusted_price", hue="Make")


# Using `sns.histplot`, make histogram of the adjusted_price of just these car makers.
makers = ['Chevrolet', 'Ford', 'Toyota']
cond = df.Make.isin(makers)
temp_df = df[cond]
ax = sns.histplot(data=temp_df, x='adjusted_price', hue='Make', bins=50)
```

![image.png](attachment:959c09cd-2285-421d-b649-ca2161e8265f.png)

# Remake the same histogram, but limit the x-axis from 0 to 100,000

In [20]:
# Remake the same histogram, but limit the x-axis from 0 to 100,000



```py
# Remake the same histogram, but limit the x-axis from 0 to 100,000

temp_df = df[condC | condF | condT]

ax = sns.histplot(data=temp_df, x="Year", y="adjusted_price", hue="Make")

# ax.set_xticks(range(0, 100000))
ax.set_xlim(0, 100000)

# ax.set_title("Adjusted Price")

df.head()

```

![image.png](attachment:a17a8e97-ea89-4ee2-8088-702d992f4d2f.png)

# Plot the relationship between Engine HP and highway MPG

In [22]:
# Plot the relationship between Engine HP and highway MPG


```py
# Plot the relationship between Engine HP and highway MPG



ax = sns.scatterplot(data=df, x='Engine HP', y='highway MPG');
ax.set_title("Relationship of the Engine HP and the Highway NPG");

df.corr()
```

![image.png](attachment:0e626609-3a9b-4933-a7b5-34f168da5061.png)

![image.png](attachment:09943812-3dd9-448d-81a5-5ea701ffa31a.png)

# Using `sns.boxplot`, create a boxplot for the 'Engine HP'

In [23]:
# create a boxplot for the 'Engine HP'


```py
# create a boxplot for the 'Engine HP'

ax = sns.boxplot(data=df, x="Engine HP")

```

![image.png](attachment:1aeb0c5e-91e5-4e00-a942-f910ff9d564a.png)

# Make another boxplot for highway MPG

In [24]:
# create a boxplot for the 'highway MPG'


```py
# create a boxplot for the 'highway MPG'

ax = sns.boxplot(data=df, x="highway MPG")

```
![image.png](attachment:66e0cdd2-3845-47e0-b18d-770cd7022b3d.png)


# Remove any  outliers from Engine HP and highway MPG 

<img src='https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png' width=500>

* Outliers meaning values that are outside 1.5x the Inter Quartile Range (see image above).
* For each column (Engine HP and highway MPG):
* Calculate the 0.25 and 0.75 Quantiles
* Calculate the Inter Quartile Range (IQR)
* Create condition mask for the values that are outliers below (in the 'Minimum' range).
* Create condition mask for the values that are outliers above (in the 'Maximum' range).
* Filter the dataframe to remove any values that are in the above section _OR_ the below section. (hint; it may be easier to use the inverse selection '~'.
* Make the same boxplots of Engine HP and Highway MPG as before but with the this dataframe.

In [25]:
# Remove any  outliers from Engine HP and highway MPG 


```py
# Remove any  outliers from Engine HP and highway MPG 

ENGINE = 'Engine HP'

# Calculate Q1
Q1 = df[ENGINE].quantile(0.25)
Q1

# # # Calculate Q3
Q3 = df[ENGINE].quantile(0.75)
Q3
# # # Define the Inter Quartile Range (IQR)
IQR = Q3 - Q1

# # # Make select condition for the values that fall below the Q1 - 1.5*IQR
outliers_below = df[ENGINE] < (Q1 - 1.5 * IQR)

# # # Make select condition for the values that fall above the Q3 - 1.5*IQR
outliers_above = df[ENGINE] > (Q3 + 1.5 * IQR)



df = df[ ~(outliers_above | outliers_below) ]


# Remove any  outliers from Engine HP and highway MPG 

HIGHWAY = 'highway MPG'

# Calculate Q1
Q1 = df[HIGHWAY].quantile(0.25)
Q1

# # # Calculate Q3
Q3 = df[HIGHWAY].quantile(0.75)
Q3
# # # Define the Inter Quartile Range (IQR)
IQR = Q3 - Q1

# # # Make select condition for the values that fall below the Q1 - 1.5*IQR
outliers_below = df[HIGHWAY] < (Q1 - 1.5 * IQR)

# # # Make select condition for the values that fall above the Q3 - 1.5*IQR
outliers_above = df[HIGHWAY] > (Q3 + 1.5 * IQR)



df = df[ ~(outliers_above | outliers_below) ]


# After removing the the outliers.
df.describe()


```

# Remake the boxplots for both Engine HP and highway MPG


In [26]:
# Engine HP boxplot


In [27]:
# highway MPG boxplot



```py
# Engine HP boxplot

ENGINE = 'Engine HP'
ax = sns.boxplot(data=df, x=ENGINE)


# highway MPG boxplot

HIGHWAY = 'highway MPG'

ax = sns.boxplot(data=df, x=HIGHWAY)

```

# Make a scatter plot of Engine HP vs highway MPG

In [28]:
# Make a scatter plot of Engine HP vs highway MPG


```py
# Make a scatter plot of Engine HP vs highway MPG



ax = sns.scatterplot(data=df, x='Engine HP', y='highway MPG', hue="Engine HP");
ax.set_title("Relationship of the Engine HP and the Highway NPG");


```
![image.png](attachment:ba32a5cb-e6eb-4484-92f7-107e29292056.png)
```py
# Make a scatter plot of Engine HP vs highway MPG
new_df = pd.merge(df_hwyMPG, df_engHP, how='outer')
ax = sns.scatterplot(data=new_df, x='Engine HP', y='highway MPG', hue="Engine HP")
```


# What does this plot tell you about how Engine HP affects highway MPG?

In [34]:
# What does this plot tell you about how Engine HP affects highway MPG?

print('Your answer here.')

Your answer here.


```py
# What does this plot tell you about how Engine HP affects highway MPG?

print('There is weak negative correlation between Enine HP and highway MPG')
```

# Using a pairplot, display all of the linear relationship.
* Which variables look like they have the strongest linear relationship (Besides MSRP and adjusted_price).

In [30]:
# Using a pairplot, display all of the linear relationship.



In [31]:
# * Which variables look like they have the strongest linear relationship (Besides MSRP and adjusted_price).



```py
# Using a pairplot, display all of the linear relationship.


sns.pairplot(df);

# * Which variables look like they have the strongest linear relationship (Besides MSRP and adjusted_price).
print("City mpg and Highway MPG seems to have the strongest linear relationship")
```

![image.png](attachment:ef2817f4-d9a6-46d2-a7d9-1190b620b946.png)

# Find which features actually have the strongest linear relationship using correlations.
* Make a heatmap plot of all of the correlations in our dataset.
* Change the figure size of our heatmap plot to be 8x8
* __Which feature does Engine HP have the strongest relationship with, and why do you think that relationship exists.__

In [32]:
# * Make a heatmap plot of all of the correlations in our dataset.
# * Change the figure size of our heatmap plot to be 8x8




In [33]:
# Which feature does Engine HP have the strongest relationship with, and why do you think that relationship exists.

print('Your answer here')

Your answer here


```py
# * Make a heatmap plot of all of the correlations in our dataset.
# * Change the figure size of our heatmap plot to be 8x8

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

# Which feature does Engine HP have the strongest relationship with, and why do you think that relationship exists.

print('Engine HP Seeems to have the strongest relationship with Engine Cylinders, I think there is a positive relationship, because you can produce more power in the car engine through an increase of the amount of cylinders, which are the chamber of the motor that generates power.')

df.head(60)

```

![image.png](attachment:e175640a-48cb-4380-8d73-58def06b093a.png)

# [EXTRA CREDIT] 
* In the column names, replace all the spaces with an underscore, and make them all lowercase as well


In [35]:
# * In the column names, replace all the spaces with an underscore, and make them all lowercase as well



```py
# * In the column names, replace all the spaces with an underscore, and make them all lowercase as well.map(format_name)
def format_name(name):
    return name.lower().replace(" ", "_")

df.columns = df.columns.map(format_name)
```