<a href="https://colab.research.google.com/github/RabHuss/RabHuss/blob/main/week_2_COVI01_01_FE_23_57474260.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**QUESTION**  
You are tasked with analyzing and pre-processing a dataset (Business Funding
Data.csv) containing business funding information in Nigeria. Your goal is to prepare the data
for further analysis by applying various preprocessing techniques. This assignment will allow
you to explore the data, identify key issues, and apply creative solutions to clean and
transform it effectively.
Deliverables:
● A notebook of a well cleaned Business Funding Data ready for analysis.
● Using a Text cell or Markdown in your notebook, answer the following questions.
○ Your observations from exploring the data.
○ The steps you took to clean, preprocess, and transform the data.
○ Justifications for each technique or decision you applied.
○ Reflections on the importance of preprocessing in real-world data analysis.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

# Load the dataset, specifying the encoding as 'latin-1'
df = pd.read_csv("Business Funding Data.csv", encoding='latin-1') # Changed line: Added encoding='latin-1'

# Display basic information and first few rows
print("Dataset Info:")
df.info()
print("\nFirst 5 rows:")
print(df.head())

# Checking for missing values
print("\nMissing Values:")
print(df.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Website Domain             26 non-null     object 
 1   Effective date             6 non-null      object 
 2   Found At                   26 non-null     object 
 3   Financing Type             8 non-null      object 
 4   Financing Type Normalized  8 non-null      object 
 5   Categories                 26 non-null     object 
 6   Investors                  13 non-null     object 
 7   Investors Count            13 non-null     float64
 8   Amount                     26 non-null     object 
 9   Amount Normalized          26 non-null     int64  
 10  Source Urls                26 non-null     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 2.4+ KB

First 5 rows:
  Website Domain Effective date                   Found At Financing Type  \
0

In [3]:
# Handling missing values
for col in df.columns:
    if df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)  # Fill categorical missing values with mode
    else:
        df[col].fillna(df[col].median(), inplace=True)  # Fill numerical missing values with median

# Removing duplicates
df.drop_duplicates(inplace=True)

# Converting date columns to datetime format
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)  # Fill categorical missing values with mode
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)  # Fill numerical missing values with median


In [4]:
# Encoding categorical variables
label_enc = LabelEncoder()
categorical_cols = df.select_dtypes(include=['object']).columns

for col in categorical_cols:
    df[col] = label_enc.fit_transform(df[col])

# Handling outliers using IQR
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
    df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])


In [5]:
# Standardizing numerical columns
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Display the cleaned dataset
print("\nCleaned Dataset:")
print(df.head())

# Save the cleaned data
df.to_csv("Cleaned_Business_Funding_Data.csv", index=False)


Cleaned Dataset:
   Website Domain  Effective date  Found At  Financing Type  \
0        1.283565             0.0 -1.730152             0.0   
1        1.438787             0.0  0.124856             0.0   
2        1.438787             0.0  1.516112             0.0   
3       -1.199984             0.0 -0.338896             0.0   
4       -0.268653             0.0 -1.266400             0.0   

   Financing Type Normalized  Categories  Investors  Investors Count  \
0                        0.0    1.023182      -0.75              0.0   
1                        0.0    1.023182      -0.25              0.0   
2                        0.0   -1.105036      -0.75              0.0   
3                        0.0    1.023182       1.75              0.0   
4                        0.0    1.023182       0.00              0.0   

     Amount  Amount Normalized  Source Urls  
0 -1.400000           1.921862     1.666667  
1  0.200000           1.921862    -1.000000  
2  1.400000           0.493204  