## Alejo Vinluan (abv210001)

# Maching Learning with SKLearn
The purpose of this Jupyter notebook is to gain experience using sklearn on a small dataset.

## Dataset Breakdown
The dataset gives the following columns:
* mpg - The average gas mileage of the vehicle

* cylinders - The number of cylinders the car has

* displacement - The engine size

* horsepower - The horsepower of the vehicle

* weight - The weight of the vehicle in pounds

* acceleration - The acceleration of a vehicle

* year - The year of the vehicle

* origin - The origin of the car (based on classification)

* name - The make and model of the car


## Read the Data
This section will use pandas to read the data, output the first few rows, and output the dimensions of the data.

In [2]:
import pandas as pd

# Import the data from the folder
data = pd.read_csv('data/Auto.csv')

# Output the first few rows
print("Head of Data Frame:")
print(data.head())

# Output the dimensions of the data
print("Number of Rows:", data.shape[0])
print("Number of Columns:", data.shape[1])

Head of Data Frame:
    mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
0  18.0          8         307.0         130    3504          12.0  70.0   
1  15.0          8         350.0         165    3693          11.5  70.0   
2  18.0          8         318.0         150    3436          11.0  70.0   
3  16.0          8         304.0         150    3433          12.0  70.0   
4  17.0          8         302.0         140    3449           NaN  70.0   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  
Number of Rows: 392
Number of Columns: 9


## Data Exploration
This section will describe the mpg, weight, and year columns.

### MPG Description
The MPG of a vehicle is a vehicle's "miles per gallon". This is how many miles a vehicle can drive per 1 gallon of fuel.

In [3]:
print("MPG Description")
print(data['mpg'].describe())

MPG Description
count    392.000000
mean      23.445918
std        7.805007
min        9.000000
25%       17.000000
50%       22.750000
75%       29.000000
max       46.600000
Name: mpg, dtype: float64


For the vehicles within the dataset, there is an average of 23.45 mpg. The vehicle with the worst fuel economy is at 9 mpg while the vehicle with the best fuel economy is at 46.6 mpg.

### Weight Description
This is a vehicle's curb weight. This will represent how many pounds a vehicle is.

In [5]:
print("Weight Description")
print(data['weight'].describe())

Weight Description
count     392.000000
mean     2977.584184
std       849.402560
min      1613.000000
25%      2225.250000
50%      2803.500000
75%      3614.750000
max      5140.000000
Name: weight, dtype: float64


According to the description returned, we find that:

* The average weight of a vehicle is 2977.58 lbs

* The lightest vehicle within the dataset is 1613 lbs.

* The heaviest vehicle in the dataset is 5140 lbs.


### Year Description
This column is the year the vehicle was released.

In [6]:
print("Year Description")
print(data['year'].describe())

Year Description
count    390.000000
mean      76.010256
std        3.668093
min       70.000000
25%       73.000000
50%       76.000000
75%       79.000000
max       82.000000
Name: year, dtype: float64


According to the year column of the dataset, we find that:

* The average year of the vehicles in this dataset is 1976

* The oldest car in the dataset is from 1970

* The youngest car in the dataset is from 1982

## Explore Data Types
This section will check the datatypes of all columns, change both the origin and cylinders columns to categorical, and verify the changes utilizing the dtypes attribute.

In [14]:
# Print the types
print("Auto Dataset Types:")
print(data.dtypes)

# Change the 'origin' and 'cylinder' columns to categorical
data['origin'] = pd.Categorical(data['origin'])
data['cylinders'] = data['cylinders'].astype('category').cat.codes

# Verify the changes were completed
print("\nAfter changing origin and cylinder to categorical")
print("Origin column type:", data['origin'].dtypes)
print("Cylinders column type:", data['cylinders'].dtypes)

Auto Dataset Types:
mpg              float64
cylinders       category
displacement     float64
horsepower         int64
weight             int64
acceleration     float64
year             float64
origin          category
name              object
dtype: object

After changing origin and cylinder to categorical
Origin column type: category
Cylinders column type: category


## Delete rows with NA Values
This section with delete the rows with NA values and output the new dimension of the dataset.

In [15]:
# Drop the rows with NA
data = data.dropna()

print("Shape of Dataset After Dropping NA")
print(" Rows:", data.shape[0])
print(" of Columns:", data.shape[1])

Shape of Dataset After Dropping NA
 Rows: 389
 of Columns: 9


After dropping the rows with NA, the rows reduced from 392 to 389. 3 rows were dropped after removing rows with NA values.

## Modify the Columns
This section will create a new column named mpg_high. mpg_high will be categorical and return a 1 if that vehicle has a mpg higher than the average or 0 if it is equal to or lower than average.

In [20]:
# Get the average mpg
avg_mpg = data['mpg'].mean()
print("Average MPG:", avg_mpg)

# Create the mpg_high column
data['mpg_high'] = data.apply(lambda row: 1 if row.mpg > avg_mpg else 0, axis=1)

# Change type of mpg_high to categorical
data['mpg_high'] = pd.Categorical(data['mpg_high'])

# Print head of the dataset
print("\nData head to show new mpg_high column")
print(data.head())

Average MPG: 23.490488431876607

Data head to show new mpg_high column
    mpg cylinders  displacement  horsepower  weight  acceleration  year  \
0  18.0         8         307.0         130    3504          12.0  70.0   
1  15.0         8         350.0         165    3693          11.5  70.0   
2  18.0         8         318.0         150    3436          11.0  70.0   
3  16.0         8         304.0         150    3433          12.0  70.0   
6  14.0         8         454.0         220    4354           9.0  70.0   

  origin                       name mpg_high  
0      1  chevrolet chevelle malibu        0  
1      1          buick skylark 320        0  
2      1         plymouth satellite        0  
3      1              amc rebel sst        0  
6      1           chevrolet impala        0  
