---
# Cleaning the Dataset
---

### Dataset Cleaning Process

This notebook demonstrates the data cleaning process for the `mtcars2.csv` dataset, which contains information about various car models and their specifications.

#### Data Loading and Initial Processing
- The dataset is imported using pandas and stored in a DataFrame called `cars`
- The first unnamed column was renamed to 'model' to better reflect its content
- The 'S.No' column was removed as it was likely just an index column that isn't needed for analysis

#### Data Cleaning Operations
- Missing values in the 'qsec' column (quarter-mile time) were filled with the mean value of that column
- The 'qsec' column was converted to numeric format using `pd.to_numeric()` with 'coerce' option to handle any non-numeric values
- The 'mpg' (miles per gallon) column was converted to float data type for proper numerical processing

#### Data Analysis
- A correlation matrix was created to understand relationships between numerical variables
- This correlation matrix (stored in `df`) helps identify which features are strongly correlated with each other
- For example, there are strong negative correlations between 'mpg' and both 'cyl' (cylinders) and 'wt' (weight)

#### Dataset Structure
- The dataset contains 32 car models with 12 features including:
    - Model name (labeled as 'Unnamed: 1')
    - Performance metrics (mpg, qsec)
    - Engine specifications (cyl, disp, hp, carb)
    - Physical attributes (wt)
    - Technical specifications (drat, vs, am, gear)
- All columns except the model name are numerical, with a mix of float and integer data types

This cleaning process ensures the dataset is ready for further analysis or modeling by addressing missing values, proper data typing, and removing unnecessary columns.

---

In [None]:
import pandas as pd # importing pandas

In [None]:
cars = pd.read_csv("mtcars2.csv")         # reading the csv file
cars = cars.rename(columns={'Unnamed : 1':'model'}) # renaming the first column
cars

Unnamed: 0,S.No,Unnamed: 1,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,6,Valiant,18.1,6,225.0,105,2.76,3.46,,1,0,3,1
6,7,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,8,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,9,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,10,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [None]:
cars.qsec = cars.qsec.fillna(cars.qsec.mean) # fill missing values in qsec with the mean of qsec
cars 

Unnamed: 0,S.No,Unnamed: 1,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,6,Valiant,18.1,6,225.0,105,2.76,3.46,<bound method Series.mean of 0 16.46\n1 ...,1,0,3,1
6,7,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,8,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,9,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,10,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [None]:
cars = cars.drop(columns=["S.No"]) # Drop the first column
cars

Unnamed: 0,Unnamed: 1,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,<bound method Series.mean of 0 16.46\n1 ...,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [None]:
cars.qsec = pd.to_numeric(cars.qsec, errors='coerce')   # convert to numeric
# Fill missing values in 'qsec' with the mean of the column    
cars.qsec = cars.qsec.fillna(cars.qsec.mean())          

# Then calculate the correlation matrix
df = cars[['mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']].corr()
df

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
mpg,1.0,-0.852162,-0.847551,-0.776168,0.681172,-0.867659,0.360351,0.664039,0.599832,0.480285,-0.550925
cyl,-0.852162,1.0,0.902033,0.832447,-0.699938,0.782496,-0.548775,-0.810812,-0.522607,-0.492687,0.526988
disp,-0.847551,0.902033,1.0,0.790949,-0.710214,0.88798,-0.385207,-0.710416,-0.591227,-0.555569,0.394977
hp,-0.776168,0.832447,0.790949,1.0,-0.448759,0.658748,-0.650674,-0.723097,-0.243204,-0.125704,0.749812
drat,0.681172,-0.699938,-0.710214,-0.448759,1.0,-0.712441,0.120175,0.440278,0.712711,0.69961,-0.09079
wt,-0.867659,0.782496,0.88798,0.658748,-0.712441,1.0,-0.130362,-0.554916,-0.692495,-0.583287,0.427606
qsec,0.360351,-0.548775,-0.385207,-0.650674,0.120175,-0.130362,1.0,0.667873,-0.271763,-0.203784,-0.573987
vs,0.664039,-0.810812,-0.710416,-0.723097,0.440278,-0.554916,0.667873,1.0,0.168345,0.206023,-0.569607
am,0.599832,-0.522607,-0.591227,-0.243204,0.712711,-0.692495,-0.271763,0.168345,1.0,0.794059,0.057534
gear,0.480285,-0.492687,-0.555569,-0.125704,0.69961,-0.583287,-0.203784,0.206023,0.794059,1.0,0.274073


In [None]:
cars.mpg = cars.mpg.astype(float)   # convert to float  
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 1  32 non-null     object 
 1   mpg         32 non-null     float64
 2   cyl         32 non-null     int64  
 3   disp        32 non-null     float64
 4   hp          32 non-null     int64  
 5   drat        32 non-null     float64
 6   wt          32 non-null     float64
 7   qsec        32 non-null     float64
 8   vs          32 non-null     int64  
 9   am          32 non-null     int64  
 10  gear        32 non-null     int64  
 11  carb        32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB


---