# Exploratory Data Analysis: Clean & Prepare Dataset
This notebook performs basic data preprocessing, including column cleanup and label encoding, and saves a cleaned version of the dataset for later use.

### Import & Initial Cleanup
This cell performs the following:

- Loads the raw dataset from FINAL.csv.
- Drops two columns that are likely irrelevant: 'coastline_wri' and 'Elevation (ft)'.
- Prints the column names to detect formatting issues like leading/trailing spaces.

In [None]:
import pandas as pd

df = pd.read_csv("FINAL.csv")
df.drop([' coastline_wri',' Elevation (ft)'], axis=1,inplace=True)
print(df.columns.tolist())


['Country                 ', ' coastline_wf', ' sub-region                     ', ' latitude  ', ' longitude  ', ' Quality of Life Value', ' Quality of Life Category', ' Elevation (m)', ' Temperature (°C)']


### Rename Columns for Usability

Cleans up column names by:

- Removing leading/trailing spaces.
- Converting names to a consistent style (e.g., underscores).
- Making feature names shorter and more manageable.

In [35]:
df.rename(columns={'Country                 ': 'Country'},inplace=True)
df.rename(columns={' coastline_wf': 'Coastline_wf'}, inplace=True)
df.rename(columns={' sub-region                     ': 'Sub-region'}, inplace=True)
df.rename(columns={' latitude  ': 'Latitude'}, inplace=True)
df.rename(columns={' longitude  ': 'Longitude'}, inplace=True)
df.rename(columns={' Quality of Life Value': 'Quality_of_Life_Value'}, inplace=True)
df.rename(columns={' Quality of Life Category': 'Quality_of_Life_Category'}, inplace=True)
df.rename(columns={' Elevation (m)': 'Elevation_m'}, inplace=True)
df.rename(columns={' Temperature (°C)': 'Temperature_C'}, inplace=True)
print(df.columns.tolist())

['Country', 'Coastline_wf', 'Sub-region', 'Latitude', 'Longitude', 'Quality_of_Life_Value', 'Quality_of_Life_Category', 'Elevation_m', 'Temperature_C']


### Select and Reorder Relevant Columns
This step manually selects important columns in a desired order, ensuring only relevant features are retained for further analysis or modeling.

In [36]:
df= df[['Country', 'Sub-region', 'Coastline_wf', 'Latitude', 'Longitude',
  'Elevation_m','Temperature_C','Quality_of_Life_Value', 'Quality_of_Life_Category',]]
print(df.head())

                    Country                        Sub-region  Coastline_wf  \
0  Albania                    Southern Europe                         362.0   
1  Algeria                    Northern Africa                         998.0   
2  Argentina                  Latin America and the Caribbean        4989.0   
3  Armenia                    Western Asia                              0.0   
4  Australia                  Australia and New Zealand             25760.0   

    Latitude   Longitude  Elevation_m  Temperature_C  Quality_of_Life_Value  \
0  41.153332   20.168331        708.0          12.44                 104.16   
1  28.033886    1.659626        800.0          23.60                  98.83   
2 -38.416097  -63.616672        595.0          16.30                 115.06   
3  40.069099   45.038189       1792.0           7.82                 116.56   
4 -25.274398  133.775136        330.0          22.05                 190.69   

    Quality_of_Life_Category  
0   'Low'          

### Label Encoding of Categorical Feature

This cell does the following:

- Prints all unique values of the Sub-region column.
- Encodes Sub-region into numerical values using LabelEncoder.
- Inserts the new encoded column Sub_region_encoded immediately after the original Sub-region column.
- Saves the cleaned DataFrame to a new CSV: FINAL_cleaned.csv.

In [38]:
from sklearn.preprocessing import LabelEncoder
for value in df["Sub-region"].dropna().unique():
    print("-", value)

    
# Encode the 'Sub-region' column
label_encoder = LabelEncoder()
encoded_col = label_encoder.fit_transform(df['Sub-region'])

# Create a new column name
new_col_name = 'Sub_region_encoded'

# Find the index of 'Sub-region' column
insert_pos = df.columns.get_loc('Sub-region') + 1

# Insert the new encoded column right after 'Sub-region'
df.insert(insert_pos, new_col_name, encoded_col)

# Show updated column order
print(df.head())
df.to_csv("FINAL_cleaned.csv", index=False)

-  Southern Europe                
-  Northern Africa                
-  Latin America and the Caribbean
-  Western Asia                   
-  Australia and New Zealand      
-  Western Europe                 
-  Southern Asia                  
-  Eastern Europe                 
-  Sub-Saharan Africa             
-  South-eastern Asia             
-  Northern America               
-  Eastern Asia                   
-  Northern Europe                
-  Central Asia                   
                    Country                        Sub-region  \
0  Albania                    Southern Europe                   
1  Algeria                    Northern Africa                   
2  Argentina                  Latin America and the Caribbean   
3  Armenia                    Western Asia                      
4  Australia                  Australia and New Zealand         

   Sub_region_encoded  Coastline_wf   Latitude   Longitude  Elevation_m  \
0                  10         362.0  41.1533

### Summary

This notebook:

- Loads and cleans a raw dataset by fixing column names and dropping irrelevant fields.
- Encodes a key categorical variable (Sub-region) for machine learning compatibility.
- Saves the cleaned DataFrame to FINAL_cleaned.csv, which is ready for further exploratory data analysis and supervised learning.