# **Data Cleaning**

## **Loading Libraries**

In [1]:
import pandas as pd
import numpy as np

## **Loading Dataset**

In [2]:
wpf = pd.read_csv("/workspaces/Global-Population-Growth-EDA-and-Prediction/data/raw/world_population_growth.csv")

## **Handling Missing Data**

### **Identifying Missing Data**

In [4]:
wpf.isnull().sum()

City                  0
Country               0
Continent            11
Population (2024)     0
Population (2023)     0
Growth Rate           0
dtype: int64

### **Handling Missing Value**

In [5]:
wpf['Continent'] = wpf['Continent'].fillna('Unknown')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  wpf['Continent'].fillna('Unknown', inplace=True)


In [6]:
wpf.isnull().sum()

City                 0
Country              0
Continent            0
Population (2024)    0
Population (2023)    0
Growth Rate          0
dtype: int64

## **Outlier Detection and Handling**

### **Identifying Outliers**

In [7]:
Q1 = wpf['Growth Rate'].quantile(0.25)
Q3 = wpf['Growth Rate'].quantile(0.75)
IQR = Q3 - Q1

outliers = wpf[(wpf['Growth Rate'] < (Q1 - 1.5 * IQR)) | (wpf['Growth Rate'] > (Q3 + 1.5 * IQR))]

In [9]:
print(outliers)

               City        Country      Continent  Population (2024)  \
47    Dar Es Salaam       Tanzania         Africa            8161231   
48         New York  United States  North America            7931147   
113         Kampala         Uganda         Africa            4050826   
116           Abuja        Nigeria         Africa            4025735   
129     Los Angeles  United States  North America            3748640   
146     Ouagadougou   Burkina Faso        Unknown            3358934   
199         Chicago  United States  North America            2590002   
226          Aleppo          Syria           Asia            2317650   
355    Philadelphia  United States  North America            1533916   
395             Uyo        Nigeria         Africa            1393453   
403          Mwanza       Tanzania         Africa            1378014   
429   Abomey Calavi          Benin         Africa            1314916   
436           Nnewi        Nigeria         Africa            130

### **Handling Outliers**

In [10]:
wpf = wpf[~((wpf['Growth Rate'] < (Q1 - 1.5 * IQR)) | (wpf['Growth Rate'] > (Q3 + 1.5 * IQR)))]

## **Feature Engineering**

### **Feature Engineering**

In [11]:
wpf['Population Change'] = wpf['Population (2024)'] - wpf['Population (2023)']

### **Scaling and Normalization**

In [12]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
wpf[['Population (2024)', 'Population (2023)', 'Growth Rate']] = scaler.fit_transform(wpf[['Population (2024)', 'Population (2023)', 'Growth Rate']])

## **Saving the Processed Data**

### **Saving Cleaned Data**

In [13]:
wpf.to_csv("/workspaces/Global-Population-Growth-EDA-and-Prediction/data/processed/cleaned_population_growth.csv", index=False)