<h1 id="data_acquisition">Data Acquisition</h1>
<p>
There are various formats for a dataset, .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>
In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>
In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.
<ul>
    <li>data source: <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data" target="_blank">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a></li>
    <li>data type: csv</li>
</ul>
The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>

In [None]:
# import pandas library
import pandas as pd
import numpy as np


<h2>Read Data</h2>
<p>
We use <code>pandas.read_csv()</code> function to read the csv file. In the bracket, we put the file path along with a quotation mark, so that pandas will read the file into a data frame from that address. The file path can be either an URL or your local file address.<br>
Because the data does not include headers, we can add an argument <code>headers = None</code>  inside the  <code>read_csv()</code> method, so that pandas will not automatically set the first row as a header.<br>
You can also assign the dataset to any variable you create.
</p>

This dataset was hosted on IBM Cloud object click <a href="https://cocl.us/DA101EN_object_storage">HERE</a> for free storage.

In [None]:
# Import pandas library
import pandas as pd

# Read the online file by the URL provides above, and assign it to variable "df"
other_path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
df = pd.read_csv(other_path, header=None)

In [None]:
df

In [None]:
df.tail()

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.count()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isna()

<h3>Add Headers</h3>
<p>
Take a look at our dataset; pandas automatically set the header by an integer from 0.
</p>
<p>
To better describe our data we can introduce a header, this information is available at:  <a href="https://archive.ics.uci.edu/ml/datasets/Automobile" target="_blank">https://archive.ics.uci.edu/ml/datasets/Automobile</a>
</p>
<p>
Thus, we have to add headers manually.
</p>
<p>
Firstly, we create a list "headers" that include all column names in order.
Then, we use <code>dataframe.columns = headers</code> to replace the headers by the list we created.
</p>

In [None]:
# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)

In [None]:
df.columns=headers

In [None]:
df

## Identify and handle missing values

In [None]:
# convert "?" to NaN

df1=df.replace('?',np.NaN)
df1

In [None]:
df.isnull().sum()

In [None]:
df["normalized-losses"].value_counts()

In [None]:
df=df1

##Replace "NaN" value with average of their column

In [None]:
df1['normalized-losses'].fillna(value=0, inplace=True)

In [None]:
df1['normalized-losses']= df1['normalized-losses'].astype(int)

In [None]:
df1['normalized-losses'].replace(0, df1['normalized-losses'].mean(), inplace=True)

In [None]:
df["normalized-losses"].value_counts()

In [None]:
df1

## Replace "NaN" value with most frequenty in coloumn

In [None]:
df["num-of-doors"].value_counts()

In [None]:
df1['num-of-doors'].fillna(value=0, inplace=True)

In [None]:
df1['num-of-doors'].replace(0, 'four', inplace=True)

In [None]:
df1['num-of-doors'].value_counts()

In [None]:
df1['bore'].fillna(value=0, inplace=True)

In [None]:
df1['bore'].value_counts()

In [None]:
df1['bore']= df1['bore'].astype(float)

In [None]:
df1['bore'].replace(0, df1['bore'].mean(), inplace=True)

In [None]:
df1['bore'].value_counts()

In [None]:
df1['bore'].dtypes

In [None]:
df1['stroke'].fillna(value=0, inplace=True)

In [None]:
df1['stroke'].value_counts()

In [None]:
df1['stroke']= df1['stroke'].astype(float)

In [None]:
df1['stroke'].dtypes

In [None]:
df1['stroke'].replace(0, df1['stroke'].mean(), inplace=True)

In [None]:
df1['stroke'].value_counts()

In [None]:
df1.dtypes

In [None]:
df1['horsepower'].fillna(value=0, inplace=True)

In [None]:
df1['horsepower'].value_counts()

In [None]:
df1['horsepower']= df1['horsepower'].astype(int)

In [None]:
df1['horsepower'].replace(0, df1['horsepower'].mean(), inplace=True)

In [None]:
df1['horsepower'].dtypes

In [None]:
df1.dtypes

In [None]:
df1['peak-rpm'].fillna(value=0, inplace=True)

In [None]:
df1['peak-rpm']= df1['peak-rpm'].astype(int)

In [None]:
df1['peak-rpm'].replace(0, df1['peak-rpm'].mean(), inplace=True)

In [None]:
df1['peak-rpm'].dtypes

In [None]:
df1['price'].fillna(value=0, inplace=True)

In [None]:
df1['price'].value_counts()

In [None]:
df1['price']= df1['price'].astype(int)

In [None]:
df1['price'].replace(0, df1['price'].mean(), inplace=True)

In [None]:
df1['price'].value_counts()

In [None]:
df1.dtypes


In [None]:
df1['num-of-cylinders'].value_counts()

In [None]:
df1.head()

In [None]:
df1

In [None]:
# conevrt miles per gallon(mpg)  value into liters per 100km using the  formula for units conversation
# is L/100km=235=mpg

df1['city-l/100km'] = 235/df1['city-mpg']
df1['highway-l/100km'] = 235/df1['highway-mpg']


In [None]:
df1[['city-l/100km', 'city-mpg', 'highway-l/100km', 'highway-mpg']].head()

In [None]:
df1.head()

In [None]:
df1[['length','width', 'height']].head()

In [None]:
# Bring original columns range to umiform range 0-1
 
# check how many numerical columns
 
df_num=df1.select_dtypes(include=np.number)/df1.select_dtypes(include=np.number).max()
df_num

In [None]:
df1

In [None]:
# Binning

df1['horsepower']=df1['horsepower'].astype(float, copy=True)

In [None]:
binW = (max(df1['horsepower'])- min(df1['horsepower']))/4
binW

In [None]:
bin= np.arange(min(df1['horsepower']), max(df1['horsepower']), binW)
bin

In [None]:
group_name=['little-horsepower', 'medium-horsepower', 'high=horsepower']

In [None]:
df1['horsepower-bin'] = pd.cut(df1['horsepower'], bin, labels=group_name, include_lowest=True )

df1[['horsepower', 'horsepower-bin']].head(20)

##Bins Visualization

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

In [None]:
mpl.style.use(['ggplot'])

total_cat =3

#Lets get the x-tick values

count, bin_edges = np.histogram(df1['horsepower'], 3)

fig, ax = plt.subplots(figsize=(12,4))
N, bins, patches = ax.hist(df1['horsepower'], bins=total_cat*10, edgecolor='white', linewidth=1)

for i in range(0, 10):
  patches[i].set_facecolor('r')
for i in range(10, 20):
  patches[i].set_facecolor('b')
for i in range(20, 30):
  patches[i].set_facecolor('g')
    
ax.set_xticks(bin_edges)

ax.set_title('Horsepower bin')
ax.set_ylabel('Count')
ax.set_xlabel('Horsepower')

red_patch = mpatches.Patch(color='red', label="Low-Horsepower")
blue_patch = mpatches.Patch(color='blue', label="Medium-Horsepower")
green_patch = mpatches.Patch(color='green', label="High-Horsepower")

plt.legend(handles=[red_patch, blue_patch, green_patch])

plt.show()


In [None]:
# convert fuel type into indicator variable


df1.columns

In [None]:
dummy_var  = pd.get_dummies(df1['fuel-type'])
dummy_var

In [None]:
dummy_var.rename(columns={'gas' :'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_var.head(10)

In [None]:
df1 = pd.concat([df1, dummy_var], axis=1)

In [None]:
df1

In [None]:
df1.drop('fuel-type', axis = 1, inplace=True)

In [None]:
df1.head(20)

In [None]:
# convert Aspiration into indicator variable


df1['aspiration'].value_counts()

In [None]:
dummy_var1  = pd.get_dummies(df1['aspiration'])
dummy_var1

In [None]:
dummy_var1.rename(columns={'std' :'aspiration_std', 'turbo':'aspiration_turbo'}, inplace=True)
dummy_var1.head(10)

In [None]:
df1 = pd.concat([df1, dummy_var1], axis=1)

In [None]:
df1

In [None]:
df1.drop('aspiration', axis = 1, inplace=True)

In [None]:
df1.head()

In [None]:
df1.corr()

In [None]:
df1[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()

In [None]:
import seaborn as sns

plt.figure(figsize=(7,7))
sns.scatterplot(x=df1['engine-size'], y=df1['price'])
plt.show()

In [None]:
import seaborn as sns

plt.figure(figsize=(7,7))
sns.scatterplot(x=df1['highway-mpg'], y=df1['price'])
plt.show()

In [None]:
import seaborn as sns

plt.figure(figsize=(7,7))
sns.scatterplot(x=df1['peak-rpm'], y=df1['price'])
plt.show()

In [None]:
import seaborn as sns

plt.figure(figsize=(7,7))
sns.scatterplot(x=df1['stroke'], y=df1['price'])
plt.show()

In [None]:
fig = plt.figure(figsize=(20,8))   

ax0 = fig.add_subplot(2, 2, 1)
ax1 = fig.add_subplot(2, 2, 2)
ax2 = fig.add_subplot(2, 2, 3)
ax3 = fig.add_subplot(2, 2, 4)

sns.set(font_scale= 1.5)

# subplot 1:

sns.regplot(x='engine-size', y='price', data=df1, color="green", marker="+",  scatter_kws={'s':50}, ax=ax0)
ax0.set_title('price va engine-size: strong positive')

# subplot 2:

sns.regplot(x="highway-mpg", y='price', data=df1, color='red', marker="*", scatter_kws={"s" :50}, ax=ax1)
ax1.set_title("price va highway-mpg : strong negative")

# subplot 3:
 
sns.regplot(x="peak-rpm", y='price', data=df1, color="blue", marker="*", scatter_kws={"s": 50} , ax=ax2)
ax2.set_title("price va peak-mpg : weak")

# subplot 4:

sns.regplot(x="stroke", y='price', data=df1, color="red", marker="+", scatter_kws={"s": 50} , ax=ax3)
ax3.set_title("price va stroke : weak")

fig.tight_layout()

plt.show()




In [None]:
# Relation between body-style and price

plt.figure(figsize=(12,4))
sns.boxplot(x="body-style", y='price', data=df1)

In [None]:
# Relation between engine-location and price

plt.figure(figsize=(12,4))
sns.boxplot(x="engine-location", y='price', data=df1)

In [None]:
# Relation between drive-wheel and price

plt.figure(figsize=(12,4))
sns.boxplot(x="drive-wheels", y='price', data=df1)

## Groups Analysis

In [None]:
df1['body-style'].unique()

In [None]:
df_group = df1[['body-style', 'price']]
df_group_result = df_group.groupby(['body-style'], as_index= False).mean()
df_group_result

In [None]:
df1['drive-wheels'].unique()

In [None]:
df_group1 = df1[['drive-wheels', 'price']]
df_group1_result = df_group1.groupby(['drive-wheels'], as_index= False).mean()
df_group1_result

In [None]:
df_group = df1[['body-style','drive-wheels' ,'price']]
df_group_result = df_group.groupby(['body-style','drive-wheels'], as_index= False).mean()
df_group_result

## Pearson Correlation coefficient Analysis

In [None]:

from scipy import stats

pearson_coef, p_value = stats.pearsonr(df1['wheel-base'], df1['price'])
print("The pearson coefficient for wheel-base vs price is", pearson_coef, "with a P-value of P =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['horsepower'], df1['price'])
print("The pearson coefficient for horsepower vs price is", pearson_coef, "with a P-value of P =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['length'], df1['price'])
print("The pearson coefficient for length vs price is ", pearson_coef, "with a P-value of P =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['width'], df1['price'])
print("The pearson coefficient for width vs price is ", pearson_coef, "with a p-value of p =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['curb-weight'], df1['price'])
print("The pearson coefficient for curb-weight vs price is ", pearson_coef, "with a p-value of p =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['engine-size'], df1['price'])
print("The pearson  coefficient for engine-size vs price is " ,  pearson_coef, "with a p-value of p =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['bore'], df1['price'])
print("The pearson coefficient for bore vs price is ", pearson_coef, "with a p-value of p =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['city-mpg'], df1['price'])
print("The pearson coefficient for city-mpg vs price is ", pearson_coef, "with a p-value of p =", p_value)

pearson_coef, p_value = stats.pearsonr(df1['highway-mpg'], df1['price'])
print("The pearson coefficient for highway-mpg vs price is ", pearson_coef, "with a p-value of p =",  p_value)



## Model develpoment and evalution

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection  import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [None]:
df_new = df1[['horsepower', "curb-weight", 'engine-size', 'highway-mpg', 'bore', 'wheel-base',
              'city-mpg', 'length', 'width']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_new, df1['price'], test_size=0.20, random_state=1)
print("number of test samples :", X_test.shape[0])
print("number of training samples :", X_train.shape[0] )

In [None]:
# Multiple Linear Regression

lm = LinearRegression()
lm.fit(X_train, y_train)

In [None]:
print("The R-squared value for Multiple Linear Regression Model is :", lm.score(X_test, y_test))

In [None]:
# Random Forest Regression

rm = RandomForestRegressor()
rm.fit(X_train, y_train)

In [None]:
print("The R-squared value for Random Forest Regression Model is :", rm.score(X_test, y_test))

In [None]:
##  Knn Regression

from sklearn.neighbors import KNeighborsRegressor

In [None]:
knr=KNeighborsRegressor()
knr.fit(X_train,y_train)

In [None]:
print("The R-squared value for knn Model is :", knr.score(X_test, y_test))

In [None]:
## SVM Regression

from sklearn import svm

In [None]:
sv=svm.SVR()
sv.fit(X_train,y_train)

In [None]:
print("The R-squared value for svm Model is :", sv.score(X_test, y_test))

In [None]:
## Desicion Tree Regressor

from sklearn.tree import DecisionTreeRegressor

In [None]:
dt=DecisionTreeRegressor()
dt.fit(X_train,y_train)

In [None]:
print("The R-squared value for Decision tree Model is :", dt.score(X_test, y_test))