# Malaria in Africa

In [None]:
# import libraries 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# load the data
df = pd.read_csv("DatasetAfricaMalaria.csv")
df.head(3)

In [None]:
# get total number of countries under study
country_names_column = df['Country Name']
unique_country_names = country_names_column.unique()
print("Total Number of Countries: ", len(unique_country_names))

In [None]:
# Get a list of unique countries
print("Unique country names:")
country_names_column.unique()

In [None]:
column_names = df.columns.tolist()
column_names

In [None]:
# rename columns: a shorter compressed names 
abbreviated_headers = {
  'Country Name' : 'countryname',
  'Year' : 'year',
  'Country Code' : 'countrycode',
  'Incidence of malaria (per 1,000 population at risk)' : 'iom(/1,000 pop)' ,
  'Malaria cases reported' : 'malariacasesreported',
  'Use of insecticide-treated bed nets (% of under-5 population)' : 'insecticidetreatednet(%juvenile)',
  'Children with fever receiving antimalarial drugs (% of children under age 5 with fever)' : 'treatment_antimalarialdrugs(%juvenile)',
  'Intermittent preventive treatment (IPT) of malaria in pregnancy (% of pregnant women)' : 'ipt_malaria(%pregnantwomen)',
  'People using safely managed drinking water services (% of population)' : 'safemanaged_drinkwater(%pop)',
  'People using safely managed drinking water services, rural (% of rural population)' : 'safemanaged_drinkwater_rural(%pop)',
  'People using safely managed drinking water services, urban (% of urban population)' : 'safemanaged_drinkwater_urban(%pop)',
  'People using safely managed sanitation services (% of population)' : 'safesanitationservices(%pop)',
  'People using safely managed sanitation services, rural (% of rural population)' : 'safesanitationservices_rural(%ruralpop)',
  'People using safely managed sanitation services, urban  (% of urban population)' : 'safesanitationservices_urban(%urbanpop)',
  'Rural population (% of total population)' : 'ruralpop(%totalpop)',
  'Rural population growth (annual %)' : 'ruralpopgrowth(annual%)',
  'Urban population (% of total population)' : 'urbanpop(%totalpop)',
  'Urban population growth (annual %)' : 'urbanpopgrowth(annual%)',
  'People using at least basic drinking water services (% of population)' : 'basicdrinkingwaterservices(%pop)',
  'People using at least basic drinking water services, rural (% of rural population)' : 'basicdrinkingwater_rural(%ruralpop)',
  'People using at least basic drinking water services, urban (% of urban population)' : 'basicdrinkingwater_urban(%urbanpop)',
  'People using at least basic sanitation services (% of population)' : 'bascisanitationservices(%pop)',
  'People using at least basic sanitation services, rural (% of rural population)' : 'bascisanitationservices_rural(%ruralpop)',
  'People using at least basic sanitation services, urban  (% of urban population)' :  'bascisanitationservices_urban(%urbanpop)',
  'latitude' : 'latitude',
  'longitude' : 'longitude',
  'geometry' : 'geometry'
}
df.rename(columns=abbreviated_headers, inplace=True)
df.head()

In [None]:
# check for null values
df.isnull().sum()

There are so many null Values in the 3 particular column we are more concerned about, they are :

1. insecticidetreatednet(%juvenile)            462

2. treatment_antimalarialdrugs(%juvenile)      472

3. ipt_malaria(%pregnantwomen)                 488

Similarly we can't say anything about safe managed drinking water and safe managed sanitation as there are also many null values :

1. safemanaged_drinkwater(%pop)                495

2. safemanaged_drinkwater_rural(%pop)          506

3. safemanaged_drinkwater_urban(%pop)          418

4. safesanitationservices(%pop)                462

5. safesanitationservices_rural(%ruralpop)     484

6. safesanitationservices_urban(%urbanpop)     462


In [None]:
# Calculate the percentage of null values in each column
null_percentage = (df.isnull().sum() / len(df)) * 100

# Get the list of columns where null percentage exceeds 70%
columns_to_drop = null_percentage[null_percentage > 70].index

# df1 is the dataframe with columns having less null value here less than 70% (columns with null values greater than 70% will be dropped.
df1 = df.drop(columns = columns_to_drop)

In [None]:
# confirm remaining null values
df1.isnull().sum()

In [None]:
# Calculate the mean for numerical columns
numerical_columns = df1.select_dtypes(include=['float64', 'int64']).columns
numerical_means = df1[numerical_columns].mean()

# Fill null values in numerical columns with the mean of each column
df1[numerical_columns] = df1[numerical_columns].fillna(numerical_means)

In [None]:
# recheck null values
df1.isnull().sum()

**We have successfully cleaned the data.**

In [None]:
# Cleaned Data, 
malaria = df1
# Summarize cleaned data
malaria.describe()

In [None]:
# shape of the data
malaria.shape

In [None]:
malaria.info()

# **Exploratory Data Analysis**

In [None]:
malaria.head()

In [None]:
# get the timeframe in years within which the data was gotten.
malaria['year'].unique()

The data consists of information taken about malaria incidences, recorded from 2007 through 2017.

**Insights From the Data.**

1. What is the trend of malaria incidences over the years?
Identify regions and time periods with high and low malaria incidence rates

In [None]:
incidence = malaria[['countryname', 'year', 'iom(/1,000 pop)']]

# Group by country and year, calculate the mean incidence of malaria
mean_malaria_incidence = incidence.groupby(['countryname', 'year']).mean().reset_index()

# Pivot the table for better visualization
pivot_table = mean_malaria_incidence.pivot(index='year',columns= 'countryname', values='iom(/1,000 pop)')

# Plotting
plt.figure(figsize=(8, 4))
sns.heatmap(pivot_table, cmap='YlGnBu')
plt.title('Incidence of Malaria (per 1,000 population at risk) by Country and Year')
plt.xlabel('Country Name')
plt.ylabel('Year')
plt.show()

The above heatmap does not show all the countries as there are many with very less incidence of malaria cases. Each cell in the heatmap corresponds to a specific country and year, with the color intensity indicating the malaria incidence rate.

Based on the heatmap, we can derive the following:

1. **Geographical patterns:** There are clear geographical patterns in malaria incidence. Countries in sub-Saharan Africa generally have higher incidence rates compared to countries in other regions.
2. **Temporal trends:** There are also noticeable temporal trends. Malaria incidence rates fluctuate over time within each country, with some years experiencing higher or lower incidence compared to others.
3. **Outliers:** A few countries stand out as having consistently high malaria incidence rates across multiple years.

**Interpretation:**
- The heatmap highlights the substantial burden of malaria in sub-Saharan Africa, particularly in countries such as Nigeria, Democratic Republic of the Congo, and Uganda.
- The temporal trends suggest that malaria incidence is not static and can vary significantly from year to year, likely influenced by factors such as climate, vector control measures, and access to healthcare.
- The outliers with consistently high incidence rates might represent countries with ongoing challenges in malaria control and prevention.

In [None]:
# Summary statistics of incidence of malaria
iom_stats = malaria.groupby('countryname')['iom(/1,000 pop)'].describe()

# Plotting the distribution of IOM by country
plt.figure(figsize=(12,12))
sns.boxplot(x='iom(/1,000 pop)', y='countryname', data=malaria, orient='h')
plt.xlabel('Incidence of Malaria (per 1,000 population)')
plt.ylabel('Country Name')
plt.title('Distribution of Incidence of Malaria by Country')
plt.show()

The boxplot showing the distribution of malaria incidence by country provides a visual representation of the spread and central tendency of malaria cases across different countries. Each boxplot represents a specific country, with the following elements:

- Median: The horizontal line inside the box represents the median malaria incidence rate, which divides the data into two halves.
- 25th and 75th percentiles: The edges of the box represent the 25th and 75th percentiles, indicating the range within which 50% of the data falls.
- Whiskers: The lines extending from the box represent the minimum and maximum values, excluding outliers.
- Outliers: Data points beyond the whiskers are considered outliers and are plotted as individual points.

Based on the boxplot, we can observe the following:

- Variability: There is considerable variability in malaria incidence across countries. Some countries have a relatively narrow interquartile range (IQR), indicating a more consistent distribution of cases, while others have a wider IQR, suggesting a more dispersed distribution.
- Outliers: There are a few countries with outlier data points, representing exceptionally high malaria incidence rates compared to the majority of countries.
- Median comparison: The median malaria incidence rate can be compared across countries to identify those with higher or lower burden of disease.

Interpretation:

Countries with higher median incidence rates and wider IQRs might require more intensive malaria control measures and interventions.
The outlier countries with extremely high incidence rates warrant further investigation to understand the underlying factors contributing to their elevated malaria burden.

In [None]:
# Plotting the relationship between country name, year, and IOM using bar plots
plt.figure(figsize=(12, 8))
sns.barplot(x='countryname', y='iom(/1,000 pop)', hue='year', data=malaria, errorbar=None,dodge=True)
plt.xlabel('Country Name')
plt.ylabel('Average Incidence of Malaria (per 1,000 population)')
plt.title('Relationship between Country Name, Year, and Incidence of Malaria')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

The barplot showing the relationship between country name, year, and incidence of malaria provides a visual representation of how malaria incidence varies across different countries and time periods. Each bar represents a specific country in a given year, with the height of the bar indicating the malaria incidence rate.

Based on the barplot, we can observe the following:

Country comparison: The plot allows for easy comparison of malaria incidence between countries within the same year. For example, we can see that Burkina Faso consistently has higher incidence rates compared to Chad across all years.
Outliers: Similar to the previous visualizations, there are a few countries with exceptionally high malaria incidence rates that stand out as outliers.
Interpretation:
The barplot highlights the substantial burden of malaria in certain countries, particularly those with consistently high incidence rates across multiple years.
The outliers with extremely high incidence rates warrant further investigation to understand the underlying causes and potential need for targeted interventions.

In [None]:
# Create a dictionary to store the top 3 countries for each year
top_3_dict = {}

# Group by 'year' and then apply the sorting and extraction of top 3 countries within each group
for year, group in malaria.groupby('year'):
    top_3_countries = group.nlargest(3, 'iom(/1,000 pop)')[['countryname', 'iom(/1,000 pop)']]
    top_3_dict[year] = top_3_countries['countryname'].values

# Convert the dictionary to a DataFrame
top_3_df = pd.DataFrame(top_3_dict)

# Rename the index for clarity
top_3_df.index = ['1st', '2nd', '3rd']

# Transpose the DataFrame so that years are columns
top_3_df = top_3_df.transpose()

# Display the results
print("Top 3 Countries with the Most Malaria Incidence Cases for Each Year:")
top_3_df

This shows the Top 3 countries with most number of Malaria Incidence cases.
This statistics will give organisations or government better information on countries to investigate, to provide measures to mitgate the occurence malaria in that area.

2. Relationship between incidence of malaria and basic water service

In [None]:
# Filter relevant columns
relevant_columns = ['year', 'iom(/1,000 pop)', 'basicdrinkingwaterservices(%pop)']
filtered_df = malaria[relevant_columns].dropna()  # Drop rows with missing values

# Group by year and calculate the mean values
mean_values = filtered_df.groupby('year').mean().reset_index()

# Plotting the relationship between basic drinking water services and malaria incidence over time
plt.figure(figsize=(6, 4))

# Line plot for basic drinking water services
sns.lineplot(x='year', y='basicdrinkingwaterservices(%pop)', data=mean_values, label='Basic Drinking Water Services')

# Line plot for malaria incidence
sns.lineplot(x='year', y='iom(/1,000 pop)', data=mean_values, color='red', label='Incidence of Malaria')

plt.xlabel('Year')
plt.ylabel('Percentage / Incidence')
plt.title('Relationship between Basic Drinking Water Services and Incidence of Malaria Over Time')
plt.legend()
plt.tight_layout()
plt.show()


**Inference -** This shows that there is a decrease in incidence of malaria with the increase in basic drinking water services.Some countries show a more rapid decline, while others experience a more gradual decrease or even fluctuations.

In [None]:
# Filter relevant columns for analysis
water_data = malaria[['countryname', 'iom(/1,000 pop)', 'basicdrinkingwaterservices(%pop)', 'basicdrinkingwater_rural(%ruralpop)', 'basicdrinkingwater_urban(%urbanpop)']]

# Calculate the average incidence of malaria for each country
avg_malaria_incidence = water_data.groupby('countryname')['iom(/1,000 pop)'].mean().reset_index()

# Plotting the relationship between basic drinking water services and malaria incidence
plt.figure(figsize=(8, 4))

# Scatter plot for rural population
sns.scatterplot(x='basicdrinkingwater_rural(%ruralpop)', y='iom(/1,000 pop)', data=water_data, label='Rural Population', alpha=0.7)

# Scatter plot for urban population
sns.scatterplot(x='basicdrinkingwater_urban(%urbanpop)', y='iom(/1,000 pop)', data=water_data, label='Urban Population', alpha=0.7)

plt.xlabel('Percentage of Population with Basic Drinking Water Services')
plt.ylabel('Incidence of Malaria (per 1,000 population)')
plt.title('Impact of Basic Drinking Water Services on Malaria Incidence')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


The above scatter plot shows the relationsip betwen Incidence of Malaria and Percentage of population with basic drinking water. We can see that rural population have less basic drinking water services, whereas the services on the urban population are much greater than that of rural population but the incidence cases of Malaria are almost the same. This shows even both population suffer the same problem of malaria, basic drinking water services is not the same. Water service is essential for reducing the incidence of malaria.

3. Relation between incidence of malaria and basic sanitation services

In [None]:
# Filter relevant columns
relevant_columns = ['year', 'iom(/1,000 pop)', 'bascisanitationservices(%pop)']
filtered_df = malaria[relevant_columns].dropna()  # Drop rows with missing values

# Group by year and calculate the mean values
mean_values = filtered_df.groupby('year').mean().reset_index()

# Plotting the relationship between basic sanitation services and malaria incidence over time
plt.figure(figsize=(6, 4))

# Line plot for basic sanitation services
sns.lineplot(x='year', y='bascisanitationservices(%pop)', data=mean_values, label='Basic Sanitation Services')

# Line plot for malaria incidence
sns.lineplot(x='year', y='iom(/1,000 pop)', data=mean_values, color='red', label='Incidence of Malaria')

plt.xlabel('Year')
plt.ylabel('Percentage / Incidence')
plt.title('Relationship between Basic Sanitation Services and Incidence of Malaria Over Time')
plt.legend()
plt.tight_layout()
plt.show()


**Inference** - This shows that there is no relationship between incidence of malaria and Basic sanitation service.

In [None]:
# Filter relevant columns for analysis
sanitation_data = malaria[['countryname', 'iom(/1,000 pop)', 'bascisanitationservices(%pop)', 'bascisanitationservices_rural(%ruralpop)', 'bascisanitationservices_urban(%urbanpop)']]

# Calculate the average incidence of malaria for each country
avg_malaria_incidence = sanitation_data.groupby('countryname')['iom(/1,000 pop)'].mean().reset_index()

# Plotting the relationship between basic sanitation services and malaria incidence
plt.figure(figsize=(8, 4))

# Scatter plot for rural population
sns.scatterplot(x='bascisanitationservices_rural(%ruralpop)', y='iom(/1,000 pop)', data=sanitation_data, label='Rural Population', alpha=0.7)

# Scatter plot for urban population
sns.scatterplot(x='bascisanitationservices_urban(%urbanpop)', y='iom(/1,000 pop)', data=sanitation_data, label='Urban Population', alpha=0.7)

plt.xlabel('Percentage of Population with Basic Sanitation Services')
plt.ylabel('Incidence of Malaria (per 1,000 population)')
plt.title('Impact of Basic Sanitation Services on Malaria Incidence')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


The above scatter plot shows that rural population have less percentage of population with basic sanitation sevices compared to urban. But as a whole we can see that some percentage of the urban population do not have  basic sanitage services.

4. Relation between urban and rural population growth over the years

In [None]:
# Filter relevant columns
relevant_columns = ['year', 'ruralpopgrowth(annual%)', 'urbanpopgrowth(annual%)']
population_growth_df = malaria[relevant_columns].dropna()  # Drop rows with missing values

# Plotting the relationship between urban and rural population growth over the years
plt.figure(figsize=(8, 4))

# Line plot for rural population growth
sns.lineplot(x='year', y='ruralpopgrowth(annual%)', data=population_growth_df, label='Rural Population Growth')

# Create a secondary y-axis for urban population growth
plt.twinx()
sns.lineplot(x='year', y='urbanpopgrowth(annual%)', data=population_growth_df, color='orange', label='Urban Population Growth')

plt.xlabel('Year')
plt.ylabel('Population Growth Rate (%)')
plt.title('Relationship between Urban and Rural Population Growth over the Years')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()


The plot highlights the ongoing global trend towards urbanization. This has important implications on various aspects of the society, such as housing, transportation, infrastructure, and service provision.

In [None]:
# Filter relevant columns for further analysis
relevant_columns = ['year', 'iom(/1,000 pop)', 'urbanpopgrowth(annual%)','ruralpopgrowth(annual%)']
malaria_urban_data = malaria[relevant_columns].dropna()  # Drop rows with missing values

# Group by year and calculate the mean values
mean_values = malaria_urban_data.groupby('year').mean().reset_index()

# Plotting the relationship between incidence of malaria and urban population growth over time
plt.figure(figsize=(7, 4))

# Line plot for urban population growth
sns.lineplot(x='year', y='urbanpopgrowth(annual%)', data=mean_values, label='Urban Population Growth')

# Line plot for rural population growth
sns.lineplot(x='year', y='ruralpopgrowth(annual%)', data=mean_values, label='Rural Population Growth')

# Line plot for incidence of malaria
sns.lineplot(x='year', y='iom(/1,000 pop)', data=mean_values, color='red', label='Incidence of Malaria')

plt.xlabel('Year')
plt.ylabel('Rate')
plt.title('Relationship between Incidence of Malaria, Urban and\n Rural Population Growth Over Time')
plt.legend()
plt.tight_layout()
plt.show()


There is no relevant relation between growth (in both rural and urg=ban areas) and incidence of Malaria. Some countries with high urbanization rates have high malaria incidence, while others with low urbanization rates also have high malaria incidence.

5. Relation between Malaria Cases reported over the countries in Africa and getting to know more about which countries are in a much needed investigation soon.


In [None]:
# Filter relevant columns for analysis
relevant_columns = ['year', 'iom(/1,000 pop)', 'malariacasesreported']
malaria_cases_data = malaria[relevant_columns].dropna()  # Drop rows with missing values

# Group by year and calculate the mean values
mean_values = malaria_cases_data.groupby('year').mean().reset_index()

# Plotting the relationship between incidence of malaria and reported cases over time
plt.figure(figsize=(6, 4))

# Line plot for reported cases of malaria
sns.lineplot(x='year', y='malariacasesreported', data=mean_values, label='Reported Cases of Malaria')

# Line plot for incidence of malaria
sns.lineplot(x='year', y='iom(/1,000 pop)', data=mean_values, color='red', label='Incidence of Malaria')

plt.xlabel('Year')
plt.ylabel('Value')
plt.title('Relationship between Incidence of Malaria and Reported Cases Over Time')
plt.legend()
plt.tight_layout()
plt.show()


The above line plot shows that over time, the number of malaria cases reported increased exponentially.

In [None]:
# Group by country and calculate the total malaria cases reported for each country
total_malaria_cases = malaria.groupby('countryname')['malariacasesreported'].sum().reset_index()

# Plotting
plt.figure(figsize=(12, 8))
sns.barplot(x='countryname', y='malariacasesreported', data=total_malaria_cases)
plt.xlabel('Country Name')
plt.ylabel('Total Malaria Cases Reported')
plt.title('Total Malaria Cases Reported by Country')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


This shows that the reported malaria cases over different countries. This helps get the idea of countries that are taking appropriate measures to reduce malaria cases.

In [None]:
# Filter relevant columns for analysis
relevant_columns = ['countryname', 'year', 'iom(/1,000 pop)', 'malariacasesreported']
malaria_country_data = malaria[relevant_columns].dropna()  # Drop rows with missing values

# Group by country and year and calculate the mean values
mean_values = malaria_country_data.groupby(['countryname', 'year']).mean().reset_index()

# Plotting the relationship between incidence of malaria and reported cases over time for different countries
plt.figure(figsize=(12, 8))

# Bar plot for reported cases of malaria for each year
sns.barplot(x='countryname', y='malariacasesreported', hue='year', data=mean_values)

plt.xlabel('Country Name')
plt.ylabel('Reported Cases of Malaria')
plt.title('Reported Cases of Malaria Over Time for Different Countries')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=90, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


Shows which countries had most reported cases increased over time. 'Congo. Dem. Rep' has the greatest number of Malria cases reported.

6. Using geographical Location to get some insights.

In [None]:
# Create a scatter mapbox plot
fig = px.scatter_mapbox(
    data_frame=malaria,
    lat='latitude',
    lon='longitude',
    color='iom(/1,000 pop)',
    size='iom(/1,000 pop)',  # Optional: You can adjust the marker size based on malaria incidence
    color_continuous_scale=px.colors.sequential.Reds,  # You can choose any color scale
    mapbox_style='open-street-map',
    hover_name='countryname',
    title='Malaria Incidence by Country'
)

# Update layout for better visualization
fig.update_layout(
    mapbox=dict(
        center=go.layout.mapbox.Center(lat=0, lon=0),
        zoom=2
    ),
    margin={"r":0,"t":0,"l":0,"b":0}
)

# Display the map
fig.show()


Relationship between longitude,latitude and malaria incidence

In [None]:
# Relationship between longitude,latitude and malaria incidence
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Select relevant columns
malaria_selected = malaria[["latitude", "longitude", "iom(/1,000 pop)"]]

# Calculate the correlation matrix
correlation_matrix = malaria_selected.corr()

# Display the correlation matrix
print("Correlation matrix:")
print(correlation_matrix)

# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation between Latitude, Longitude and Malaria Incidence")
plt.show()

- Latitude and Longitude: There is a very weak negative correlation between latitude and longitude, indicating that as latitude increases, longitude tends to decrease slightly.
- Latitude and Incidence of Malaria: There is a very weak positive correlation between latitude and malaria incidence, suggesting that as latitude increases, malaria incidence tends to increase slightly.
- Longitude and Incidence of Malaria: There is a very weak negative correlation between longitude and malaria incidence, implying that as longitude increases, malaria incidence tends to decrease slightly.

Overall, the correlation matrix suggests that there are no strong linear relationships between latitude, longitude, and malaria incidence in the provided dataset.

In [None]:
# Plot the relationship between latitude and malaria incidence
plt.figure(figsize=(8, 6))
plt.scatter(malaria_selected["latitude"], malaria_selected["iom(/1,000 pop)"])
plt.title("Relationship between Latitude and Malaria Incidence")
plt.xlabel("Latitude")
plt.ylabel("Malaria Incidence (per 1,000 population)")
plt.show()

The scatter plot visualizes the relationship between latitude and malaria incidence per 1,000 population. Each data point on the plot represents a location with its corresponding latitude and malaria incidence.

The following can be observed:

1. Distribution of data points: The data points are spread out across the plot, indicating that there is no clear pattern or trend between latitude and malaria incidence.
2. No Clusters: There are no distinct clusters or groups of data points, suggesting that there is no strong association between latitude and malaria incidence.
3. Wide range of malaria incidence: Malaria incidence varies widely across different latitudes, ranging from low values (close to 0) to high values (above 200).

**Inference**
Overall, the scatter plot suggests that there is no significant correlation between latitude and malaria incidence in the provided dataset. This means that latitude alone cannot be used to predict malaria incidence accurately. There seem to be a few data points with high malaria incidence at lower latitudes (below 0). These could potentially be outliers or regions with specific conditions that contribute to higher malaria transmission.  The majority of data points are concentrated in the middle range of latitudes (between 0 and 20). This could indicate that malaria incidence is more prevalent in certain climate zones or regions.

In [None]:
# Plot the relationship between longitude and malaria incidence
plt.figure(figsize=(8, 6))
plt.scatter(malaria_selected["longitude"], malaria_selected["iom(/1,000 pop)"])
plt.title("Relationship between Longitude and Malaria Incidence")
plt.xlabel("Longitude")
plt.ylabel("Malaria Incidence (per 1,000 population)")
plt.show()

The scatter plot visualizes the relationship between longitude and malaria incidence per 1,000 population. Each data point on the plot represents a location with its corresponding longitude and malaria incidence.

The following observations can be made:

Distribution of data points: The data points are spread out across the plot, indicating that there is no clear pattern or trend between longitude and malaria incidence.
No obvious clusters: There are no distinct clusters or groups of data points, suggesting that there is no strong association between longitude and malaria incidence.
Wide range of malaria incidence: Malaria incidence varies widely across different longitudes, ranging from low values (close to 0) to high values (above 200).
There seem to be a few data points with high malaria incidence at longitudes around 20 and 120. These could potentially be outliers or regions with specific conditions that contribute to higher malaria transmission.
The majority of data points are concentrated between longitudes -20 and 40. This could indicate that malaria incidence is more prevalent in certain geographical regions.
 Overall the scatter plot suggests that there is no significant correlation between longitude and malaria incidence in the provided dataset. This means that longitude alone cannot be used to predict malaria incidence accurately. Other factors or variables might need to be considered for a more comprehensive analysis.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of target variable
plt.figure(figsize=(8, 6))
sns.histplot(malaria['iom(/1,000 pop)'], kde=True)
plt.title('Distribution of Malaria Incidence')
plt.xlabel('Incidence (per 1,000 population)')
plt.ylabel('Frequency')
plt.show()


The distribution of malaria incidence refers to the spread or pattern of malaria cases across a population or geographical area. It provides insights into the prevalence and severity of malaria in different regions.

Based on the histogram, we can observe the following:

- Skewed distribution: The histogram is skewed to the right, indicating that there are more data points with lower malaria incidence values compared to higher values.The majority of data points are concentrated in the lower range of malaria incidence (between 0 and 50 per 1,000 population).
- Outliers: There are a few data points with extremely high malaria incidence values (above 200 per 1,000 population). These could potentially be outliers or regions with specific conditions that contribute to higher malaria transmission.

Overall, the distribution suggests that malaria incidence is relatively low in most regions represented in the dataset. However, there are certain areas or populations that experience significantly higher malaria incidence.

**SUMMARY OF THE EDA**

Data Descriptipn:

 The initial/raw dataset consists of 594 rows × 27 columns which was reduced to 594 rows x 18 columns during cleaning .

Data Cleaning :

Missing values were handled by droping columns with more than 70% of null values, and imputing the mean values.
Also the culumn headers were renamed for better data handling.

Exploratory Data Analysis:
- The distribution of each column was visualized using histograms and boxplots.
- The relationships between columns were visualized using scatter plots or correlation matrices.
- The summary statistics of each column were calculated.

**Key Findings:**

- Based on the heatmap, there are clear geographical patterns in malaria incidence. Countries in sub-Saharan Africa generally have higher incidence rates compared to countries in other regions. The heatmap highlights the substantial burden of malaria in sub-Saharan Africa, particularly in countries such as Nigeria, Democratic Republic of the Congo, and Uganda.
- Access to basic drinking water services might be one of several factors contributing to the reduction of malaria incidence over time.
- The variability in the relationship between drinking water services and malaria highlights the need for country-specific analyses to understand the local context and identify the most effective strategies for malaria control.
- Access to basic drinking water services alone cannot fully explain the variation in malaria incidence across countries.
- Access to basic sanitation services might be one of several factors contributing to the reduction of malaria incidence over time.
- The variability in the rate of urbanization suggests that different countries are at different stages of the urbanization process. This might be influenced by factors such as economic development, government policies, and cultural preferences.
- The plot suggests that the relationship between malaria incidence and urbanization is complex and likely influenced by a combination of factors beyond simply the proportion of urban or rural population.
- In many countries, there seems to be a general trend of increasing reported cases over time, while the malaria incidence rate remains relatively stable or even decreases.
- The increasing trend in reported cases over time might reflect improvements in surveillance and reporting systems, leading to better detection and documentation of malaria cases.
- The correlation matrix suggests that there are no strong linear relationships between latitude, longitude, and malaria incidence in the provided dataset.
- There is a negative correlation between the use of insecticide-treated bed nets and malaria incidence, indicating that bed nets may be effective in reducing malaria transmission.
- Rural population percentage shows a positive correlation with malaria incidence, suggesting that rural areas may be more susceptible to malaria.


**Conclusion:**

The EDA provided valuable insights into the structure and characteristics of the dataset. The findings from the EDA will be used to inform the next steps in the data analysis process, such as feature selection, model building, and interpretation.