# Data Normalization

### Step 1: Install and Import Libraries

In [None]:
!pip install pandas
!pip install matplotlib
!pip install scikit-learn 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

### Step 2: Load the Dataset into a DataFrame

In [None]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

df = pd.read_csv(file_path)

# Display the first few rows to check if data is loaded correctly
print(df.head())


### Section 1: Handling Duplicates
#### Task 1: Identify and remove duplicate rows.

In [None]:
## Write your code here
dup = df[df.duplicated()] 
print("duplicate value") 
print(dup)

### Section 2: Handling Missing Values
#### Task 2: Identify missing values in CodingActivities

In [None]:
## Write your code here
cod_mis = df['CodingActivities'].isnull().sum() 
print("missing value in CodingActivities") 
print(cod_mis)

#### Task 3: Impute missing values in CodingActivities with forward-fill.

In [None]:
## Write your code here
df['CodingActivities'].fillna(method='ffill' , inplace = True) 
print(df['CodingActivities'].isnull().sum())

##### Note: Before normalizing ConvertedCompYearly, ensure that any missing values (NaN) in this column are handled appropriately. You can choose to either drop the rows containing NaN or replace the missing values with a suitable statistic (e.g., median or mean).

### Section 3: Normalizing Compensation Data
#### Task 4: Identify compensation-related columns, such as ConvertedCompYearly.
##### Normalization is commonly applied to compensation data to bring values within a comparable range. Here, you’ll identify ConvertedCompYearly or similar columns, which contain compensation information. This column will be used in the subsequent tasks for normalization.

In [None]:
## Write your code here
print(df.columns)
comp_col = [col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower() or 'pay' in col.lower()] 
print("compensation related col") 
print(comp_col) 
comp_mis = df['ConvertedCompYearly'].isnull().sum() 
print("missing value in ConvertedCompYearl") 
print(comp_mis) 
mean_val = df['ConvertedCompYearly'].mean() 
df['ConvertedCompYearly'].fillna(mean_val , inplace=True) 
print("after fill ") 
print(df['ConvertedCompYearly'].isnull().sum())

#### Task 5: Normalize ConvertedCompYearly using Min-Max Scaling.
##### Min-Max Scaling brings all values in a column to a 0-1 range, making it useful for comparing data across different scales. Here, you will apply Min-Max normalization to the ConvertedCompYearly column, creating a new column ConvertedCompYearly_MinMax with normalized values.



In [None]:
## Write your code here
scaler = MinMaxScaler() 
df['ConvertedCompYearly_MinMax'] = scaler.fit_transform(df[['ConvertedCompYearly']]) 
print(df['ConvertedCompYearly']) 
print(df['ConvertedCompYearly_MinMax'])

#### Task 6: Apply Z-score Normalization to ConvertedCompYearly.
##### Z-score normalization standardizes values by converting them to a distribution with a mean of 0 and a standard deviation of 1. This method is helpful for datasets with a Gaussian (normal) distribution. Here, you’ll calculate Z-scores for the ConvertedCompYearly column, saving the results in a new column ConvertedCompYearly_Zscore.



In [None]:
## Write your code here
mean = df['ConvertedCompYearly'].mean() 
std = df['ConvertedCompYearly'].std() 
df['ConvertedCompYearly_Zscore'] = (df['ConvertedCompYearly'] - mean)/ std 
print(df['ConvertedCompYearly_Zscore'])

### Section 4: Visualization of Normalized Data
#### Task 7: Visualize the distribution of ConvertedCompYearly, ConvertedCompYearly_Normalized, and ConvertedCompYearly_Zscore
##### Visualization helps you understand how normalization changes the data distribution. In this task, create histograms for the original ConvertedCompYearly, as well as its normalized versions (ConvertedCompYearly_MinMax and ConvertedCompYearly_Zscore). This will help you compare how each normalization technique affects the data range and distribution.