# DATA VISUALIZATION WITH PYTHON

**In this checkpoint, we are going to work on the 'Climate change in Africa' dataset that was provided by the U.S global change research program.**

*Dataset description : This dataset contains historical data about the daily min, max and average temperature fluctuation in 5 African countries (Egypt, Tunisia, Cameroon, Senegal, Angola) between 1980 and 2023.*

➡️ Dataset link

https://i.imgur.com/w2czdso.jpg


**Instructions**

1. Load the dataset into a data frame using Python.
2. Clean the data as needed.
3. Plot a line chart to show the average temperature fluctuations in Tunisia and Cameroon. Interpret the results.
4. Zoom in to only include data between 1980 and 2005, try to customize the axes labels.
5. Create Histograms to show temperature distribution in Senegal between [1980,2000] and [2000,2023] (in the same figure). Describe the obtained results.
6. Select the best chart to show the Average temperature per country.
7. Make your own questions about the dataset and try to answer them using the appropriate visuals

### Importing necessary libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

### Loading the dataset

In [2]:
df = pd.read_csv('Africa_climate_change.csv')

### Overview of the Dataset

In [3]:
df.head()

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY
0,19800101 000000,,54.0,61.0,43.0,Tunisia
1,19800101 000000,,49.0,55.0,41.0,Tunisia
2,19800101 000000,0.0,72.0,86.0,59.0,Cameroon
3,19800101 000000,,50.0,55.0,43.0,Tunisia
4,19800101 000000,,75.0,91.0,,Cameroon


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 464815 entries, 0 to 464814
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   DATE     464815 non-null  object 
 1   PRCP     177575 non-null  float64
 2   TAVG     458439 non-null  float64
 3   TMAX     363901 non-null  float64
 4   TMIN     332757 non-null  float64
 5   COUNTRY  464815 non-null  object 
dtypes: float64(4), object(2)
memory usage: 21.3+ MB


### Data cleaning and exploration

In [5]:
# Convert the DATE column to datetime
df['DATE'] = pd.to_datetime(df['DATE'], format='%Y%m%d %H%M%S')

In [6]:
df.head()

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY
0,1980-01-01,,54.0,61.0,43.0,Tunisia
1,1980-01-01,,49.0,55.0,41.0,Tunisia
2,1980-01-01,0.0,72.0,86.0,59.0,Cameroon
3,1980-01-01,,50.0,55.0,43.0,Tunisia
4,1980-01-01,,75.0,91.0,,Cameroon


In [7]:
df['COUNTRY'].unique()

array(['Tunisia', 'Cameroon', 'Senegal', 'Egypt', 'Angola'], dtype=object)

In [8]:
def replace_missing_values(df, country_column, country_name, column_WMV):
    
    mean_val = df[df[country_column] == country_name][column_WMV].mean()
    mean_val = round(mean_val, 2)
    
    # Fill missing TMIN values with the calculated mean for Angola
    df.loc[df[country_column] == country_name, column_WMV] = df[df[country_column] == country_name][column_WMV].fillna(mean_val)

In [9]:
#repalcing the missing values with mean for Tunisia in each column with missing values
replace_missing_values(df, 'COUNTRY', 'Tunisia', 'TAVG')
replace_missing_values(df, 'COUNTRY', 'Tunisia', 'TMAX')
replace_missing_values(df, 'COUNTRY', 'Tunisia', 'TMIN')

In [10]:
#repalcing the missing values with mean for Cameroon in each column with missing values
replace_missing_values(df, 'COUNTRY', 'Cameroon', 'TAVG')
replace_missing_values(df, 'COUNTRY', 'Cameroon', 'TMAX')
replace_missing_values(df, 'COUNTRY', 'Cameroon', 'TMIN')

In [11]:
#repalcing the missing values with mean for Senegal in each column with missing values
replace_missing_values(df, 'COUNTRY', 'Senegal', 'TAVG')
replace_missing_values(df, 'COUNTRY', 'Senegal', 'TMAX')
replace_missing_values(df, 'COUNTRY', 'Senegal', 'TMIN')

In [12]:
#repalcing the missing values with mean for Egypt in each column with missing values
replace_missing_values(df, 'COUNTRY', 'Egypt', 'TAVG')
replace_missing_values(df, 'COUNTRY', 'Egypt', 'TMAX')
replace_missing_values(df, 'COUNTRY', 'Egypt', 'TMIN')

In [13]:
#repalcing the missing values with mean for Angola in each column with missing values
replace_missing_values(df, 'COUNTRY', 'Angola', 'TAVG')
replace_missing_values(df, 'COUNTRY', 'Angola', 'TMAX')
replace_missing_values(df, 'COUNTRY', 'Angola', 'TMIN')

In [14]:
df.isnull().sum()

DATE            0
PRCP       287240
TAVG            0
TMAX            0
TMIN            0
COUNTRY         0
dtype: int64

In [15]:
df.head()

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY
0,1980-01-01,,54.0,61.0,43.0,Tunisia
1,1980-01-01,,49.0,55.0,41.0,Tunisia
2,1980-01-01,0.0,72.0,86.0,59.0,Cameroon
3,1980-01-01,,50.0,55.0,43.0,Tunisia
4,1980-01-01,,75.0,91.0,69.27,Cameroon


In [16]:
df['PRCP'] = df['PRCP'].fillna(0)

### Visualization

In [17]:
# Extract year, month, and day
df['Year'] = df['DATE'].dt.year
df['Month'] = df['DATE'].dt.month
df['Day'] = df['DATE'].dt.day

In [18]:
df = df.sort_values(by='DATE')

In [19]:
df.head()

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY,Year,Month,Day
0,1980-01-01,0.0,54.0,61.0,43.0,Tunisia,1980,1,1
24,1980-01-01,0.0,82.96,94.0,62.0,Senegal,1980,1,1
23,1980-01-01,0.0,49.0,52.0,57.17,Tunisia,1980,1,1
22,1980-01-01,0.0,84.0,91.0,69.27,Cameroon,1980,1,1
21,1980-01-01,0.0,80.0,97.0,68.0,Senegal,1980,1,1


In [30]:
#line chart to show the average temperature fluctuations in Tunisia and Cameroon. Interpret the results.

# Filter the data for Tunisia and Cameroon
tunisia_df = df[df['COUNTRY'] == 'Tunisia']
cameroon_df = df[df['COUNTRY'] == 'Cameroon']

# Calculate the mean TAVG for each date
tunisia_avg = tunisia_df.groupby('Year')['TAVG'].mean().reset_index()
cameroon_avg = cameroon_df.groupby('Year')['TAVG'].mean().reset_index()

# Add country information to the dataframes
tunisia_avg['COUNTRY'] = 'Tunisia'
cameroon_avg['COUNTRY'] = 'Cameroon'

# Combine the dataframes
combined_df = pd.concat([tunisia_avg, cameroon_avg])

# Plot the data using Plotly
fig = px.line(combined_df, x='Year', y='TAVG', color='COUNTRY', title='Average Temperature Fluctuations in Tunisia and Cameroon')

# Show the plot
fig.show()





From the above Line chart we can see the Tunisia has a lower average Temperature compared to Cameroon
- The peak for Cameroon was in 1991
- The peak for Tunisia was in 1999

In [26]:
# Sort the DataFrame by the DATE column
df_sorted = df.sort_values(by='DATE')

In [27]:
# Filter the data for Tunisia and Cameroon and for dates between 1980 and 2005
filtered_df = df_sorted[(df_sorted['DATE'] >= '1980-01-01') & (df_sorted['DATE'] <= '2005-12-31')]

# Aggregate data by year and calculate the mean TAVG
tunisia_avg = filtered_df[filtered_df['COUNTRY'] == 'Tunisia'].groupby(filtered_df['DATE'].dt.year)['TAVG'].mean().reset_index()
cameroon_avg = filtered_df[filtered_df['COUNTRY'] == 'Cameroon'].groupby(filtered_df['DATE'].dt.year)['TAVG'].mean().reset_index()

# Add country information to the dataframes
tunisia_avg['COUNTRY'] = 'Tunisia'
cameroon_avg['COUNTRY'] = 'Cameroon'

# Rename the DATE column to YEAR for clarity
tunisia_avg.rename(columns={'DATE': 'YEAR'}, inplace=True)
cameroon_avg.rename(columns={'DATE': 'YEAR'}, inplace=True)

# Combine the dataframes
combined_df = pd.concat([tunisia_avg, cameroon_avg])

# Plot the data using Plotly
fig = px.line(combined_df, x='YEAR', y='TAVG', color='COUNTRY', title='Average Temperature Fluctuations in Tunisia and Cameroon (1980-2005)')

# Customize axis labels
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Average Temperature (TAVG)',
    xaxis=dict(
        tickmode='linear',
        dtick=1  # Ensure the x-axis shows each year
    )
)






In [22]:
#Create Histograms to show temperature distribution in Senegal between [1980,2000] and [2000,2023] (in the same figure). Describe the obtained results.

# Filter data for Senegal between 1980-2000 and 2000-2023
senegal_df_1980_2000 = df[(df['COUNTRY'] == 'Senegal') & (df['DATE'] >= '1980-01-01') & (df['DATE'] <= '2000-12-31')]
senegal_df_2000_2023 = df[(df['COUNTRY'] == 'Senegal') & (df['DATE'] >= '2000-01-01') & (df['DATE'] <= '2023-12-31')]

# Create histograms
hist_1980_2000 = go.Histogram(
    x=senegal_df_1980_2000['TAVG'],
    opacity=0.6,
    name='1980-2000',
    marker=dict(color='blue')
)

hist_2000_2023 = go.Histogram(
    x=senegal_df_2000_2023['TAVG'],
    opacity=0.6,
    name='2000-2023',
    marker=dict(color='red')
)

# Combine histograms in one figure
fig = go.Figure(data=[hist_1980_2000, hist_2000_2023])

# Update layout for better clarity
fig.update_layout(
    title='Temperature Distribution in Senegal (1980-2000 vs 2000-2023)',
    xaxis_title='Average Temperature (TAVG)',
    yaxis_title='Frequency',
    barmode='overlay',
    bargap=0.1
)

# Show the plot
fig.show()


In [25]:
#Select the best chart to show the Average temperature per country.
sampled_df = df.sample(1000)
fig = px.bar(sampled_df, x='COUNTRY', y='TAVG', title='Average Temperature per Country',
             labels={'TAVG': 'Average Temperature (°F)', 'COUNTRY': 'Country'})


# Show the plot
fig.show()