<a href="https://colab.research.google.com/github/Pavankumar2124/Class/blob/main/module2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Module 2: Data Transformation

Name: K.Pavankumar  
Registration Number: 22BCE2124

# Overview
This module focuses on essential data transformation techniques, including handling missing values with MLE, discretization, deduplication, and outlier detection. We'll employ the traditional Maximum Likelihood Estimation (MLE) method to estimate missing values effectively.

# Step 1: Load the Dataset
Load the dataset

In [None]:
# Importing the necessary libraries
import pandas as pd

# Load the dataset
data = pd.read_csv('shootings.csv')

data.head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,arms_category
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,Asian,Shelton,WA,True,attack,Not fleeing,False,Guns
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,White,Aloha,OR,False,attack,Not fleeing,False,Guns
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,Hispanic,Wichita,KS,False,other,Not fleeing,False,Unarmed
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,White,San Francisco,CA,True,attack,Not fleeing,False,Other unusual objects
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,Hispanic,Evans,CO,False,attack,Not fleeing,False,Piercing objects


# Step 2: Data Deduplication
Check and remove duplicate entries to ensure data quality.

In [None]:
# Check for duplicate rows
duplicates = data.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# Remove duplicates if present
data = data.drop_duplicates()
print(f"Dataset shape after removing duplicates: {data.shape}")

Number of duplicate rows: 0
Dataset shape after removing duplicates: (4895, 15)


# Step 3: Handling Missing Data
Identify and handle missing values in the dataset using traditional methods.

In [None]:
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)

# Handling missing values in 'duration' column with the mean
#data['race'].fillna('White', inplace=True)
print(data['race'].mode())
data['race'].fillna('White',inplace=True)
print(data['manner_of_death'].mode())

data['manner_of_death'].fillna('shot',inplace=True)
print(data['armed'].mode())
data['armed'].fillna('gun',inplace=True)

print(data['arms_category'].mode())
data['arms_category'].fillna('Guns',inplace=True)


# Verify that there are no more missing values
print("Missing values after handling:")
print(data.isnull().sum())
data.to_csv('inter.csv')


Missing values in each column:
id                          0
name                        0
date                        0
manner_of_death             0
armed                       0
age                         0
gender                      0
race                        0
city                        0
state                       0
signs_of_mental_illness     0
threat_level                0
flee                        0
body_camera                 0
arms_category               0
age_binned                 10
dtype: int64
0    White
Name: race, dtype: object
0    shot
Name: manner_of_death, dtype: object
0    gun
Name: armed, dtype: object
0    Guns
Name: arms_category, dtype: object
Missing values after handling:
id                          0
name                        0
date                        0
manner_of_death             0
armed                       0
age                         0
gender                      0
race                        0
city                        0
state     

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['race'].fillna('White',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['manner_of_death'].fillna('shot',inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values

# Step 4: Handling Missing Data with Maximum Likelihood Estimation (MLE)
Using MLE to estimate missing values in the 'prestige' column based on other features.

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np
data_mle = data.copy()
data_mle.loc[0:5, 'duration'] = np.nan  # Introduce missing values for demonstration


# Prepare the data for MLE - exclude rows where 'duration' is missing for training
train_data = data_mle.dropna(subset=['duration'])
predict_data = data_mle[data_mle['duration'].isnull()]

# Define the features (excluding 'duration') and target ('duration')
X_train = train_data[['age', 'campaign', 'emp.var.rate']]
y_train = train_data['duration']
X_predict = predict_data[['age', 'campaign', 'emp.var.rate']]

# Train a simple linear regression model as an MLE estimator
mle_model = LinearRegression()
mle_model.fit(X_train, y_train)

# Predict missing 'duration' values using MLE
predicted_values = mle_model.predict(X_predict)
data_mle.loc[data_mle['duration'].isnull(), 'duration'] = predicted_values

# Display the updated dataset with estimated 'duration' values
print("Dataset after MLE handling for 'duration':")
print(data_mle.head())

KeyError: "['campaign', 'emp.var.rate'] not in index"

# Step 5: Data Discretization
Discretize continuous variables like 'articles' into categorical bins.

In [None]:
bins = [0, 20, 40, 60, 80]
labels = ['0-20', '21-40', '41-60', '61-80']
data['age_binned'] = pd.cut(data['age'], bins=bins, labels=labels)

# Display the updated dataframe with binned categories
print("Dataset with 'age' discretized into bins:")
print(data[['age', 'age_binned']].head())

Dataset with 'age' discretized into bins:
    age age_binned
0  53.0      41-60
1  47.0      41-60
2  23.0      21-40
3  32.0      21-40
4  39.0      21-40


# Step 6: Outlier Detection
Detect and handle outliers in numerical data using the Interquartile Range (IQR) method.

In [None]:
Q1 = data['duration'].quantile(0.25)
Q3 = data['duration'].quantile(0.75)
IQR = Q3 - Q1

# Defining outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detecting outliers
outliers = data[(data['duration'] < lower_bound) | (data['duration'] > upper_bound)]
print(f"Number of outliers detected in 'duration': {len(outliers)}")

# Optionally remove outliers (uncomment the line below to remove them)
# data = data[(data['duration'] >= lower_bound) & (data['duration'] <= upper_bound)]
print(f"Dataset shape after outlier handling: {data.shape}")

Number of outliers detected in 'duration': 559
Dataset shape after outlier handling: (8235, 22)


# Conclusion
This notebook demonstrates essential data cleaning and transformation techniques, including deduplication, MLE imputation, discretization, and outlier detection. The refined dataset will be valuable for subsequent in-depth analyses.