# <font color='green'> Data Preparation:</font>



-  Data preparation is the critical process of transforming raw data into a clean, consistent, and analysis-ready format.


**topics under Data Preparation**
 
- Handling Missing Values.


- Handling Duplicates.


- Data Formating.


- Data Normalization.



### <font color='blue'>1.Handling Missing Values:</font>


- Missing values are usually represented in the form of **NaN or null or None or NA in the dataset.


- According to our requirment we can remove or Repace with mean,median or other  Values.





In [1]:
import pandas as pd
import numpy as np

# Creating Data without np.nan
dict = {'First Score':[1, 2, " " , 3],
        'Second Score': [1, 2, 3, "NAN "],
        'Third Score':["NA" , 1, 2, 3]}

df=pd.DataFrame(dict)
df
# Replacing different null values at a time   
df.replace({" ":np.nan,"NAN ":np.nan,"NA":np.nan},inplace=True)

a = df.isna()
print(a)

# Count how many missing values are there in Columns  
b = df.isna().sum()
print(b)

# Checking Rows which rows having null values
c = df[df.isna().any(axis=1)]
print(c) 

   First Score  Second Score  Third Score
0        False         False         True
1        False         False        False
2         True         False        False
3        False          True        False
First Score     1
Second Score    1
Third Score     1
dtype: int64
   First Score  Second Score  Third Score
0          1.0           1.0          NaN
2          NaN           3.0          2.0
3          3.0           NaN          3.0


**Drop Rows With Missing Values**

In [2]:
dict = {'First Score':[1, 2, np.nan , 3],
        'Second Score': [1, 2, 3, np.nan],
        'Third Score':[np.nan , 1, 2, 3]}
df=pd.DataFrame(dict)

a=df.dropna()
print(a)

   First Score  Second Score  Third Score
1          2.0           2.0          1.0


**Fill Missing Values**

In [3]:
# fill the missing values in each column with 0

data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}

df=pd.DataFrame(data)
df.fillna(0,inplace=True)
print(df)


# fill missing values with the mean of each column.
data1= {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5] }

df1=pd.DataFrame(data1)
df1.fillna(df1.mean(), inplace=True)
print(df1)

# replace missing values with mean for single column
data2= {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5] }

df2=pd.DataFrame(data2)
df2["A"].fillna(df2["A"].mean(),inplace=True)
print(df2)


     A    B    C
0  1.0  0.0  1.0
1  2.0  2.0  2.0
2  3.0  3.0  0.0
3  0.0  4.0  0.0
4  5.0  5.0  5.0
      A    B         C
0  1.00  3.5  1.000000
1  2.00  2.0  2.000000
2  3.00  3.0  2.666667
3  2.75  4.0  2.666667
4  5.00  5.0  5.000000
      A    B    C
0  1.00  NaN  1.0
1  2.00  2.0  2.0
2  3.00  3.0  NaN
3  2.75  4.0  NaN
4  5.00  5.0  5.0


### <font color='blue'>2.Handle Duplicates Values:</font>

- **duplicated()** - to check for duplicates


- **drop_duplicates()** - remove duplicate rows


In [4]:
data = {
    'A': [1,  2, 2, 3, 3, 4],
    'B': [5, 6, 6, 7, 8, 8] }

df = pd.DataFrame(data)

# detect duplicates
a=df.duplicated()
print(a)

#  remove duplicates based on column 'A'
b=df.drop_duplicates()
print(b)



0    False
1    False
2     True
3    False
4    False
5    False
dtype: bool
   A  B
0  1  5
1  2  6
3  3  7
4  3  8
5  4  8


### <font color='blue'>3.Data Formatting:</font>


- The process of converting the  data one formate To another formate.


- **1.String Formatting**  
- **2.Numeric Formatting**
- **3.Data Type Conversion**

In [5]:
#  String Formatting

name="sachin"
age=25
string = f"My name is {name} and I am {age} years old"
print(string)


#  Numeric Formatting

price = 1234.56789
            # Format number to two decimal places
formatted_price = f"${price:.2f}"
print(formatted_price) 


# Data Type Conversion.

num_str = "100"
num_int = int(num_str)    # Convert string to integer
print(num_int)
 
num_float = float(num_str)  # Convert string to float
print(num_float)  

My name is sachin and I am 25 years old
$1234.57
100
100.0


In a real world scenario, data are taken from various sources which causes inconsistencies in format of the data. For example, a column can have data of **integer and string** type as the data is copied from different sources.

In [6]:
import pandas as pd


data = {
    'Country': ['USA', 'Canada', 'Australia', 'Germany', 'Japan'],
    'Date': ['2023-07-20', '2023-07-21', '2023-07-22', '2023-07-23', '2023-07-24'],
    'Temperature': [25.5, '28.0', 30.2, 22.8, 26.3]
}
df = pd.DataFrame(data)


# convert temperature column to float
df['Temperature'] = df['Temperature'].astype(float)

# calculate the mean temperature
mean_temperature = df['Temperature'].mean()

print(mean_temperature)


# we converted all the values of Temperature column to float using astype()

# Here, the Temperature column contains data in an inconsistent format, with a mixture of float and string types,

26.560000000000002


**DateTime**

- we can convert any valid string to DateTime using to_datetime()

In [7]:
import pandas as pd

# create a dataframe with date strings
df = pd.DataFrame({'date': ['2021-01-13', '2022-10-22', '2023-12-03']})
df["date"] = pd.to_datetime(df["date"])
print(df)

        date
0 2021-01-13
1 2022-10-22
2 2023-12-03


### <font color='blue'>4.Data Normalization:</font>


- Transforming numerical data into a standard format, which helps in improving the accuracy of the models.


- Normalizing data is a technique used to rescale the data so that it falls within a similar scale or range.

**1.Min-Max Normalization.**

- MinMaxScaler is created. This scaler will normalize the feature values to a range between **0 and 1.**

In [8]:
import pandas as pd

data = {
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)



from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()

df[['Feature1', 'Feature2']] = min_max_scaler.fit_transform(df[['Feature1', 'Feature2']])
df[['Feature1', 'Feature2']] 

Unnamed: 0,Feature1,Feature2
0,0.0,0.0
1,0.25,0.25
2,0.5,0.5
3,0.75,0.75
4,1.0,1.0


**2. Z-Score Normalization.**


- z-score normalization sees features rescaled in a way that follows standard normal distribution property with μ=0 and σ=1,



z = (X – μ) / σ

where:

- X is a single raw data value
- μ is the population mean
- σ is the population standard deviation

In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler


data = {
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)



standard_scaler = StandardScaler()

df[['Feature1', 'Feature2']] = standard_scaler.fit_transform(df[['Feature1', 'Feature2']])
df[['Feature1', 'Feature2']]

Unnamed: 0,Feature1,Feature2
0,-1.414214,-1.414214
1,-0.707107,-0.707107
2,0.0,0.0
3,0.707107,0.707107
4,1.414214,1.414214
