# Life Expectancy Data Science Project

---------------------------------------------

### Introduction
This analysis explores a life expectancy dataset, aiming to uncover factors affecting life expectancy across countries over time. We'll handle missing values, engineer features, perform exploratory analysis, visualize patterns, and build a regression model to predict life expectancy.

### Objectives
- Understand the structure and quality of the dataset
- Identify key features affecting life expectancy
- Handle missing data appropriately
- Engineer new features to improve prediction
- Visualize relationships and trends
- Build a regression model to predict life expectancy
- Evaluate model performance using cross-validation
- Derive actionable insights

### Task 1: Explore Dataset and Missing Values

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score


In [None]:
df = pd.read_csv('Life_Expectancy_Data.csv')
df.shape 

In [None]:
df.dtypes

In [None]:
df.dtypes.value_counts()

In [None]:
df.columns

In [None]:
for column in df.columns:
    df.rename(columns={column: column.strip()}, inplace=True)

df.columns

In [None]:
df.head(20)

In [None]:
print(df.duplicated())

In [None]:
#Find total of duplicated values
print(df.duplicated().sum())

In [None]:
# Get the number of unique countries from the 'Country' column
number_of_countries = df['Country'].nunique()

# Print the number of unique countries
print(f"The total number of unique countries in the dataset is: {number_of_countries}")

### Task 2: Handle Missing Data and Justify Method

In [None]:
null_values = df.isnull().sum()

In [None]:
#Checks if any column has NaN
df.isnull().any()

In [None]:
#Checks if any row has NaN
df.isnull().any(axis=1)

In [None]:
#Checks if all values in a column are NaN
df.isnull().all()

In [None]:
#Checks if all values in a row are NaN
df.isnull().all(axis=1)

In [None]:
null_percentage = (df.isnull().sum() / len(df))*100
print(null_percentage)

In [None]:
missing_df = pd.DataFrame({'Missing Values': null_values, 'Percent Missing': null_percentage})
missing_df[missing_df['Missing Values'] > 0]

### Task 3: Apply Chosen Method and Evaluate

In [None]:
numeric_columns = df.select_dtypes(include = 'number')
df.fillna(df.mean(numeric_only=True), inplace=True)
df.isnull().sum()

In [None]:
nonNumericColumns = df.select_dtypes(include = 'object')
for column in nonNumericColumns.columns:
    df[column].fillna(df[column].mode()[0])
    
df.isnull().sum()

### Task 4: Identify Potential Features

In [39]:
# Display all columns in the DataFrame
pd.set_option('display.max_columns', None)
#Description of the dataset transposed
df.describe(include='all')

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Health Spending Ratio,Deaths per Infant
count,1649,1649.0,1649,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0
unique,133,,2,,,,,,,,,,,,,,,,,,,,,
top,Afghanistan,,Developing,,,,,,,,,,,,,,,,,,,,,
freq,16,,1407,,,,,,,,,,,,,,,,,,,,,
mean,,2007.840509,,69.302304,168.215282,32.553062,4.533196,698.973558,79.217708,2224.494239,38.128623,44.220133,83.564585,5.955925,84.155246,1.983869,5566.031887,14653630.0,4.850637,4.907762,0.631551,12.119891,0.01922,0.0003643007
std,,4.087711,,8.796834,125.310417,120.84719,4.029189,1759.229336,25.604664,10085.802019,19.754249,162.897999,22.450557,2.299385,21.579193,6.03236,11475.900117,70460390.0,4.599228,4.653757,0.183089,2.795388,0.073488,0.008237884
min,,2000.0,,44.0,1.0,0.0,0.01,0.0,2.0,0.0,2.0,0.0,3.0,0.74,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,4.2,2.8e-05,0.0
25%,,2005.0,,64.4,77.0,1.0,0.81,37.438577,74.0,0.0,19.5,1.0,81.0,4.41,82.0,0.1,462.14965,191897.0,1.6,1.7,0.509,10.3,0.001131,4.39981e-08
50%,,2008.0,,71.7,148.0,3.0,3.79,145.102253,89.0,15.0,43.7,4.0,93.0,5.84,92.0,0.1,1592.572182,1419631.0,3.0,3.2,0.673,12.3,0.003121,1.343009e-06
75%,,2011.0,,75.0,227.0,22.0,7.34,509.389994,96.0,373.0,55.8,29.0,97.0,7.47,97.0,0.7,4718.51291,7658972.0,7.1,7.1,0.751,14.0,0.011613,1.178474e-05


###  Task 5: Feature Engineering

In [None]:
df['Health Spending Ratio'] = df['Total expenditure'] / df['GDP']
df['Deaths per Infant'] = df['infant deaths'] / df['Population']

### Task 6: Impact of New Features

In [None]:
df[['Health Spending Ratio', 'Deaths per Infant']].describe()

### Task 7: Select Key Variables for Visualization

In [None]:
df[['Life expectancy', 'GDP', 'Schooling', 'Alcohol', 'BMI', 'HIV/AIDS']].corr()

### Task 8: Visualizations

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(df.select_dtypes(include='number').corr(), cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Status', y='Life expectancy', data=df)
plt.title('Life Expectancy by Development Status')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Alcohol', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs Alcohol')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='Hepatitis B', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs Hepatitis B')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='HIV/AIDS', y='Life expectancy', hue='Status')
plt.title('Life Expectancy vs HIV/AIDS')
plt.show()

In [None]:
# 3D Plot
fig = px.scatter_3d(df, x='GDP', y='Schooling', z='Life expectancy',
                     color='Status', size='Population')
fig.show()

### Task 9: Interpretation
- Higher GDP and schooling are associated with higher life expectancy.
- Developing countries tend to have more outliers and lower average life expectancy.
- HIV/AIDS has a strong negative correlation with life expectancy.

### Task 10: Data Splitting and Model Training

In [None]:
features = ['GDP', 'Schooling', 'Alcohol', 'BMI', 'HIV/AIDS']
X = df[features]
y = df['Life expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)

### Task 11: Cross Validation and Model Evaluation

In [None]:
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
cross_val = cross_val_score(model, X, y, cv=5).mean()
mae, r2, cross_val

### Task 12: Conclusion and Recommendations
- **Key Findings**: Life expectancy is positively influenced by GDP, schooling, and healthcare access. HIV/AIDS is a major negative predictor.
- **Model Performance**: The linear model gives reasonable accuracy with cross-validation.
- **Recommendation**: Focus on improving education, economic stability, and healthcare to raise life expectancy.