### Project Overview: Analyzing Climate Change in Africa

In this project, we'll be working with the **'Climate Change in Africa' dataset**, provided by the U.S. Global Change Research Program. This dataset contains valuable historical data on daily minimum, maximum, and average temperature fluctuations across five African countries: **Egypt, Tunisia, Cameroon, Senegal,** and **Angola**, spanning from **1980 to 2023**.

📊 **Dataset Description**: The data offers insights into temperature trends and patterns across the selected countries, presenting an opportunity to explore and visualize climate variations over the years.

➡️ [**Dataset Link**](https://drive.google.com/file/d/1I8eV4-8p61CNNlVJzzho2xeoZ5-P7Q0F/view)

---

### Instructions

1. **Load the Dataset**  
   Begin by importing the dataset into a DataFrame using Python.

2. **Data Cleaning**  
   Perform necessary data cleaning to ensure accuracy and consistency in your analysis.

3. **Line Chart Visualization**  
   Create a line chart to display the average temperature fluctuations in **Tunisia** and **Cameroon**. Analyze and interpret the observed trends.

4. **Time Frame Focus (1980-2005)**  
   Zoom in on the data between **1980 and 2005**, and customize the axes labels for better clarity.

5. **Histograms of Temperature Distribution**  
   Generate histograms showing the temperature distribution in **Senegal**, comparing the periods **1980-2000** and **2000-2023** within the same figure. Summarize the key insights.

6. **Country-Wise Temperature Visualization**  
   Choose the most appropriate chart type to represent the **average temperature per country**.

7. **Exploratory Analysis**  
   Formulate your own questions about the dataset and explore answers using relevant visuals.


In [None]:
import warnings
warnings.filterwarnings("ignore")

### Importing necessary libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

### Load the dataset

In [None]:
# Load the dataset into a DataFrame
df = pd.read_csv('Africa_climate_change.csv')

# Display the first few rows of the dataset to confirm it has loaded correctly
df.head()

### EDA and Cleaning

In [None]:
# Brief description of the dataset
df.info()

In [None]:
# A summary statistics of all the columns in the df
df.describe(include = 'all').T

In [None]:
# Display 30 random rows
df.sample(n = 30)

In [None]:
# Convert DATE column to datetime format
df['DATE'] = pd.to_datetime(df['DATE'], format='%Y%m%d %H%M%S', errors = 'coerce')

In [None]:
df.head()

### Handling the missing values

In [None]:
df['COUNTRY'].unique()

In [None]:
# Group by country and calculate summary statistics for TAVG
country_tavg_stats = df.groupby('COUNTRY')['TAVG'].agg(['mean', 'std'])

# Display the summary statistics
country_tavg_stats

In [None]:
# Group by country and calculate summary statistics for TMAX
country_tmax_stats = df.groupby('COUNTRY')['TMAX'].agg(['mean', 'std'])

# Display the summary statistics
country_tmax_stats

In [None]:
# Group by country and calculate summary statistics for TAVG
country_tmin_stats = df.groupby('COUNTRY')['TMIN'].agg(['mean', 'std'])

# Display the summary statistics
country_tmin_stats

- We can group the dataset by country and then fill in the missing values for temperature columns using the mean for that specific country, since temperatures can vary significantly by region

In [None]:
# Fill missing temperature values with the mean of each country
df['TAVG'] = df.groupby('COUNTRY')['TAVG'].transform(lambda x: x.fillna(x.mean()))
df['TMAX'] = df.groupby('COUNTRY')['TMAX'].transform(lambda x: x.fillna(x.mean()))
df['TMIN'] = df.groupby('COUNTRY')['TMIN'].transform(lambda x: x.fillna(x.mean()))

# Check if the missing values are filled
print(df.isnull().sum())

##### Before we replace the missing values in PRCP let's see if there're relationships with the other columns

In [None]:
# To check the correlation between temperature and precipitation

correlation_tavg_prcp = df[['TAVG', 'PRCP']].corr().iloc[0, 1]
print(f"Correlation between TAVG and PRCP: {correlation_tavg_prcp}")

correlation_tmax_prcp = df[['TMAX', 'PRCP']].corr().iloc[0, 1]
print(f"Correlation between TMAX and PRCP: {correlation_tmax_prcp}")

correlation_tmin_prcp = df[['TMIN', 'PRCP']].corr().iloc[0, 1]
print(f"Correlation between TMIN and PRCP: {correlation_tmin_prcp}")

The results show a very weak correlation let's check for countries

##### Relationship with countries

In [None]:
# Group by country and calculate correlation between PRCP and TAVG
country_prcp_tavg_corrs = df.groupby('COUNTRY').apply(lambda x: x['PRCP'].corr(x['TAVG'])).reset_index(name='PRCP_TAVG_Corr')

# Group by country and calculate correlation between PRCP and TMIN
country_prcp_tmin_corrs = df.groupby('COUNTRY').apply(lambda x: x['PRCP'].corr(x['TMIN'])).reset_index(name='PRCP_TMIN_Corr')

# Group by country and calculate correlation between PRCP and TMAX
country_prcp_tmax_corrs = df.groupby('COUNTRY').apply(lambda x: x['PRCP'].corr(x['TMAX'])).reset_index(name='PRCP_TMAX_Corr')

# Merge the correlation results into a single DataFrame
country_corrs = pd.merge(country_prcp_tavg_corrs, country_prcp_tmin_corrs, on='COUNTRY')
country_corrs = pd.merge(country_corrs, country_prcp_tmax_corrs, on='COUNTRY')


country_corrs

##### Still no correlation

In [None]:
# Group by country and calculate summary statistics for PRCP
country_prcp_stats = df.groupby('COUNTRY')['PRCP'].agg(['mean', 'median', 'min', 'max', 'std', 'count'])

# Display the summary statistics
country_prcp_stats

We see that there's a variation in the PRCCP values across the countries

##### Replace missing values with the mean precipitation for each country

In [None]:
# Calculate the mean precipitation for each country
mean_prcp_by_country = df.groupby('COUNTRY')['PRCP'].mean()

# Define a function to fill missing values with the country-specific mean
def fill_missing_prcp(row):
    if pd.isna(row['PRCP']):
        return mean_prcp_by_country[row['COUNTRY']]
    else:
        return row['PRCP']

# Apply the function to fill missing values
df['PRCP'] = df.apply(fill_missing_prcp, axis=1)

# Verify the changes
df.head()

## Visualizations

### Create a line chart to display the average temperature fluctuations in Tunisia and Cameroon.

In [None]:
# Extract year, month, and day
df['Year'] = df['DATE'].dt.year
#df['Month'] = df['DATE'].dt.month
# Format the month names as "Jan", "Feb", etc.
df['Month'] = df['DATE'].dt.strftime('%b')
df['Day'] = df['DATE'].dt.day

df.head()

In [None]:
# Filter the data for Tunisia and Cameroon
df_filtered = df[df['COUNTRY'].isin(['Tunisia', 'Cameroon'])]

# Group by 'YEAR' and 'COUNTRY', then calculate the average temperature
df_yearly = df_filtered.groupby(['Year', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_yearly, x='Year', y='TAVG', color='COUNTRY',
              title='Average Yearly Temperature Fluctuations in Tunisia and Cameroon',
              labels={'Year': 'Year', 'TAVG': 'Average Temperature (°C)'})

fig.update_layout(legend_title_text='Country')
fig.show()

#### By month

In [None]:
# Define the correct order for months
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Convert 'Month' to a categorical type with the specified order
df_filtered['Month'] = pd.Categorical(df_filtered['Month'], categories=month_order, ordered=True)

# Group by 'Month' and 'COUNTRY', then calculate the average temperature
df_monthly = df_filtered.groupby(['Month', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_monthly, x='Month', y='TAVG', color='COUNTRY',
              title='Average Monthly Temperature Fluctuations in Tunisia and Cameroon',
              labels={'Month': 'Month', 'TAVG': 'Average Temperature (°C)'})

fig.update_layout(legend_title_text='Country')
fig.show()

### Zoom in on the data between 1980 and 2005, and customize the axes labels for better clarity.

In [None]:
# Filter the data between 1980 and 2005
df_filtered_2 = df_filtered[(df_filtered['Year'] >= 1980) & (df_filtered['Year'] <= 2005)]

# Group by 'Year' and 'COUNTRY', then calculate the average temperature
df_yearly = df_filtered_2.groupby(['Year', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_yearly, x='Year', y='TAVG', color='COUNTRY',
              title='Average Yearly Temperature Fluctuations in Tunisia and Cameroon (1980-2005)',
              labels={'Year': 'Year', 'TAVG': 'Average Temperature (°C)'})

# Customize x-axis and y-axis labels
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Average Temperature (°C)',
    xaxis=dict(
        tickmode='array'
    )
)

### Temperature Distribution in Senegal (1980-2000 vs 2000-2023)

In [None]:
# Filter the data for Senegal
senegal_df = df[df['COUNTRY'] == 'Senegal']

# Extract the year from 'DATE'
senegal_df['Year'] = senegal_df['DATE'].dt.year

# Split the data into two periods: 1980-2000 and 2000-2023
senegal_1980_2000 = senegal_df[(senegal_df['Year'] >= 1980) & (senegal_df['Year'] <= 2000)]
senegal_2000_2023 = senegal_df[(senegal_df['Year'] > 2000) & (senegal_df['Year'] <= 2023)]


# Create histograms
hist_1980_2000 = go.Histogram(
    x=senegal_1980_2000['TAVG'],
    opacity=0.6,
    name='1980-2000',
    marker=dict(color='blue')
)

hist_2000_2023 = go.Histogram(
    x=senegal_2000_2023['TAVG'],
    opacity=0.6,
    name='2000-2023',
    marker=dict(color='red')
)

# Combine histograms in one figure
fig = go.Figure(data=[hist_1980_2000, hist_2000_2023])

# Update layout for better visibility
fig.update_layout(
    barmode='overlay',
    title='Temperature Distribution in Senegal (1980-2000 vs 2000-2023)',
    xaxis_title='Average Temperature (°C)',
    yaxis_title='Frequency',
    legend_title_text='Period',
    legend=dict(
        x=0.05, y=0.95,
        bgcolor='rgba(255, 255, 255, 0.5)'
    )
)

# Show the figure
fig.show()

From the above Line chart we can see the Tunisia has a lower average Temperature compared to Cameroon
- The peak for Cameroon was in 1991
- The peak for Tunisia was in 1999

### Country-Wise Temperature Visualization

In [None]:
# Group by country and calculate the average temperature
country_avg_temp = df.groupby('COUNTRY')['TAVG'].mean().reset_index()

# Create a bar chart
fig = px.bar(country_avg_temp, x='COUNTRY', y='TAVG', 
             title='Average Temperature per Country',
             labels={'TAVG': 'Average Temperature (°C)', 'COUNTRY': 'Country'})

# Customize the layout for better clarity
fig.update_layout(xaxis_title='Country', yaxis_title='Average Temperature (°C)', 
                  xaxis_tickangle=-45, 
                  title_font_size=20)

# Show the figure
fig.show()