<a href="https://colab.research.google.com/github/Aqsaafk/Scifor/blob/main/Handling_Missing_and_Duplicate_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Handling Missing Values
Handling missing values is necessary for several reasons:

1. **Preventing Biased Analysis**: Missing values can introduce bias into statistical analyses and machine learning models. If not properly handled, missing values can skew results and lead to incorrect conclusions.

2. **Maintaining Data Integrity**: Missing values can compromise the integrity of the dataset, making it less reliable for analysis or modeling purposes. By handling missing values appropriately, we ensure that the dataset remains accurate and trustworthy.

3. **Improving Model Performance**: Machine learning algorithms often struggle to handle missing values. By imputing or removing missing values, we can improve the performance of our models and ensure that they learn from the available data effectively.

4. **Enhancing Data Quality**: High-quality data is essential for making informed decisions. Handling missing values is an important step in data cleaning and preparation, which ultimately leads to higher data quality and more reliable insights.

5. **Compliance with Analysis Requirements**: In certain industries or applications, such as healthcare or finance, there may be strict regulations or standards regarding missing data handling. Ensuring compliance with these requirements is crucial for legal and ethical reasons.

##Using Pandas

Pandas is a powerful Python library widely used for data manipulation and analysis. It provides data structures and functions to efficiently handle missing values in datasets. One of the commonly used methods for handling missing values in Pandas is the fillna() method, which is used to impute or fill missing values with specific values.




##Loading the Dataset

We are taking the Titanic dataset from Kaggle to understand the handling of missing values.

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


##Exploring the Dataset

In [2]:
# Checking the dimensions of the dataset
print("Dataset dimensions:", df.shape)

# Checking the data types of each column
print("\nData types:")
print(df.dtypes)

# Summary statistics
print("\nSummary statistics:")
print(df.describe())

# Checking for missing values
print("\nMissing values:")
print(df.isnull().sum())

Dataset dimensions: (891, 12)

Data types:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Summary statistics:
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Par

##Identifying Missing Values

In [3]:
# Display columns with missing values
print("\nColumns with missing values:")
print(df.columns[df.isnull().any()])

# Percentage of missing values in each column
print("\nPercentage of missing values in each column:")
print((df.isnull().sum() / len(df)) * 100)



Columns with missing values:
Index(['Age', 'Cabin', 'Embarked'], dtype='object')

Percentage of missing values in each column:
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64


##Handling Missing Values

Filling missing values with the mean

In [4]:
# Imputing missing values with mean
df_imputed = df.fillna(df.mean())

# Checking if missing values are filled
print("\nMissing values after imputation:")
print(df_imputed.isnull().sum())



Missing values after imputation:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


  df_imputed = df.fillna(df.mean())


In [5]:
# Comparing summary statistics before and after handling missing values
print("\nSummary statistics before handling missing values:")
print(df.describe())

print("\nSummary statistics after handling missing values:")
print(df_imputed.describe())


Summary statistics before handling missing values:
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

Summary statistics after han

##Handling Duplicate Values

Handling duplicate values is essential for several reasons:

1. **Maintaining Data Integrity**: Duplicate values can compromise the integrity of a dataset, leading to inaccurate analyses and modeling results. By identifying and removing duplicates, we ensure that each data point is unique and representative of the underlying population.

2. **Preventing Biased Analysis**: Duplicate values can skew statistical analyses and machine learning models, leading to biased conclusions. Removing duplicates helps prevent overrepresentation of certain data points, ensuring fair and unbiased analyses.

3. **Improving Data Quality**: High-quality data is crucial for making informed decisions. Handling duplicate values enhances data quality by eliminating redundancies and ensuring that each observation contributes uniquely to the analysis.

4. **Enhancing Model Performance**: Machine learning algorithms may be sensitive to duplicate values, leading to suboptimal performance. By removing duplicates, we improve the efficiency and effectiveness of machine learning models, enabling them to learn from diverse and representative data.

5. **Compliance with Analysis Requirements**: In certain industries or applications, such as finance or healthcare, there may be strict regulations or standards regarding data quality and integrity. Handling duplicate values ensures compliance with these requirements, mitigating legal and ethical risks.



##Load the Dataset

We'll work with the "Adult Income" dataset, which contains information about individuals' demographic and income attributes. This dataset is often used for classification tasks to predict whether an individual earns more than $50K annually.

In [10]:
import pandas as pd

# Load the Adult Income dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df = pd.read_csv(url, header=None, names=column_names)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


##Check for duplicate rows in the Dataset

In [14]:
# Checking for duplicate rows in the dataset
print("Number of duplicate rows:", df.duplicated().sum())

print("Dataset dimensions before removing duplicate rows:", df.shape)


Number of duplicate rows: 24
Dataset dimensions before removing duplicate rows: (32561, 15)


##Removing Duplicate Rows

In [12]:
# Removing duplicate rows from the dataset
df_cleaned = df.drop_duplicates()

print("Dataset dimensions after removing duplicate rows:", df_cleaned.shape)


Dataset dimensions after removing duplicate rows: (32537, 15)


##Evaluating the Impact

In [15]:
# Comparing summary statistics before and after handling duplicate values
print("\nSummary statistics before handling duplicate values:")
print(df.describe())

print("\nSummary statistics after handling duplicate values:")
print(df_cleaned.describe())


Summary statistics before handling duplicate values:
                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  

Summary 

##Conclusion

Handling missing values and duplicate values is a crucial step in the data preprocessing pipeline, ensuring the integrity and reliability of the dataset for analysis and modeling purposes.

**Handling Missing Values:**
Dealing with missing values involves identifying and addressing instances where data entries are absent or incomplete. This is achieved through techniques such as imputation (replacing missing values with a suitable estimate) or removal (eliminating rows or columns with missing values). By appropriately handling missing values, analysts and data scientists can prevent biases, maintain data quality, and improve the performance of machine learning models.

**Handling Duplicate Values:**
Detecting and removing duplicate values involves identifying instances where data entries are identical or redundant. Duplicate values can introduce bias, skew statistical analyses, and compromise the integrity of modeling efforts. By removing duplicate values, analysts and data scientists can ensure that each observation is unique and representative, thus enhancing the reliability and accuracy of subsequent analyses and models.

In conclusion, handling missing values and duplicate values are essential data preprocessing tasks that contribute to the overall quality and trustworthiness of the dataset. By implementing appropriate strategies to address these issues, analysts and data scientists can derive meaningful insights and build reliable models to support informed decision-making processes.