# Data analysis question
how can we determine the best price to sell the car?
- is there data on the prices of other cars and their characteristics
- what features of the cars affect prices


# Getting starting to analyze
- Check all the data types to see if it makes sense with the given column. `dataframe.dtypes`
    - ex. price of cars should not be an object data type and it should be float.
    
    
- `dataframe.describe(include='all')` gives statistical value including objects.
    - unique: unique value
    - top: most frequent value appeared
    - freq: how many times top appeard
    - NaN (Not a Number) appearing in row means the datatype is not numeric

# Data wrangling
- converting initial format into a format that is better for analysis

In [3]:
import pandas as pd
import numpy as np

In [4]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df = pd.read_csv(filename, names = headers)

In [5]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


# Evaluate missing data
- `df.isnull()` finds all null value
- `df.notnull()` finds all notnull value

In [6]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


# Identify the missing values

In [7]:
df.replace("?", np.nan, inplace = True)
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


# Deal with missing data
- Drop the whole row/column (only if many values are missing)
- replace the data with mean, frequency, etc.

### Replacing with MEAN

In [8]:
null_count = 0
for x in df['normalized-losses'].isnull():
    if x == True:
        null_count += 1
print(null_count)

41


In [9]:
avg_norm_loss=df['normalized-losses'].astype('float').mean(axis=0)
print(avg_norm_loss)

122.0


In [10]:
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


### Replacing with Frequency
find the most frequent value and replace

In [11]:
df['num-of-doors'].value_counts()

four    114
two      89
Name: num-of-doors, dtype: int64

In [12]:
df['num-of-doors'].value_counts().idxmax()

'four'

In [13]:
df["num-of-doors"].replace(np.nan, "four", inplace=True)

### Dropping the whole column

In [None]:
df.dropna(subset=["price"], axis=0, inplace=True)

In [None]:
# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

# Correcting data format
`.dtype()` to check data type

`.astype()` to change data type

In [None]:
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")

# Data normalization
you normalize the data to us specific statisitcal/computational method to compare between datas. The transformed values will have similar range so it is easy to compare.
- X_new = X_old / X_max
- X_new = (X_old - X_min) / (X_max - X_min)

In [None]:
df['length'] = df['length']/df['length'].max()

# Binning
transforming numerical variables into object categorical bins for grouped analysis
`linspace(start_value, end_value, number_of_divider)`

# ===========Getting rid of NaN value for horsepower=================

In [None]:
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

In [None]:
df["horsepower"]=df["horsepower"].astype(int, copy=True)

# =============================================================

To build 3 bins of equal size bandwidth
- include the minimum value of horsepower, set start_value = `min(df["horsepower"])`
- include the maximum value of horsepower, set end_value = `max(df["horsepower"])`
- since there is 3 equal bin, there will be 4 divider

In [None]:
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins

In [None]:
group_names = ['Low', 'Medium', 'High']

use `pd.cut` to determine what each value of horsepower belongs to

In [None]:
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(10)

In [None]:
df["horsepower-binned"].value_counts()