# Cleaning and preprocessing data

Cleaning and Pre-processing data are essential steps in data analysis or machine leaning workflows. Here's a guide on how to handle these task effectively in Python

Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np

Load your dataset

In [4]:
#Example: Load a csv file into a pandas dataframe
df=pd.read_csv('cars.csv')

Inspect the Data

Understand the struture and issues in datasets:

In [5]:
#View the first fe rows
print(df.head())

    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0       130.0    3504          12.0   
1  15.0          8         350.0       165.0    3693          11.5   
2  18.0          8         318.0       150.0    3436          11.0   
3  16.0          8         304.0       150.0    3433          12.0   
4  17.0          8         302.0       140.0    3449          10.5   

   model_year origin                       name  
0          70    usa  chevrolet chevelle malibu  
1          70    usa          buick skylark 320  
2          70    usa         plymouth satellite  
3          70    usa              amc rebel sst  
4          70    usa                ford torino  


In [7]:
#Check for missing values
print(df.isnull().sum())

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64


In [9]:
#Get data types and summary statistics
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
None
              mpg   cylinders  displacement  horsepower       weight  \
count  398.000000  398.000000    398.000000  392.000000   398.000000   
mean    23.514573    5.454774    193.425879  104.469388  2970.424623   
std      7.815984    1.701004    104.269838   38.491160   846.841774   
min      9.000000    3.000000     68.000000   46.000000  1613.00000

#Handling missing values
--Drop rows/columns

In [11]:
#Drop rows with missing values
df=df.dropna()

In [12]:
#Drop columns with missing values
df=df.dropna(axis=1)

In [13]:
#Impute missing values:
# Fill missing values with a specific value
df['mpg'].fillna(0,inplace=True)

In [14]:
#fill with the mean/median/mode
#df['column_name'].fillna(df['name'].mean(),inplace=True)

In [16]:
#Handle Duplicates
df=df.drop_duplicates()

In [17]:
#Standardize column names
df.columns=df.columns.str.lower().str.replace('','_')

#Convert Data types
# Convert to datetime
df['date_column'] = pd.to_datetime(df['date_column'])

# Convert to category
df['category_column'] = df['category_column'].astype('category')

# Convert numerical strings to floats
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')

#Handle Outliers
#Using Z scores:
from scipy.stats import zscore
df['zscore']=zscore(df['numeric_column'])
df=df[df['zscore'].abs()<3] #Keep rows within 3 standard deviations

10. Scale and Normalize Data
Standardization (zero mean, unit variance):

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])


Normalization (range 0-1):
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['normalized_column'] = scaler.fit_transform(df[['numeric_column']])

    

In [20]:
#Save Clean data
df.to_csv('cleaned_dataset.csv',index=False)