## Analyzing Malaria

##### Imports

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

# Malaria

**Enter Description of Malaria**

In [19]:
# Let's read in the data first
df = pd.read_csv('Malaria.csv')
df

Unnamed: 0,Country,Year,No. of cases,No. of deaths,No. of cases_median,No. of cases_min,No. of cases_max,No. of deaths_median,No. of deaths_min,No. of deaths_max,WHO Region
0,Afghanistan,2017,630308[495000-801000],298[110-510],630308,495000.0,801000.0,298,110.0,510.0,Eastern Mediterranean
1,Algeria,2017,0,0,0,,,0,,,Africa
2,Angola,2017,4615605[3106000-6661000],13316[9970-16600],4615605,3106000.0,6661000.0,13316,9970.0,16600.0,Africa
3,Argentina,2017,0,0,0,,,0,,,Americas
4,Armenia,2017,0,0,0,,,0,,,Europe
...,...,...,...,...,...,...,...,...,...,...,...
851,Venezuela (Bolivarian Republic of),2010,57257[47000-74000],52[9-90],57257,47000.0,74000.0,52,9.0,90.0,Americas
852,Viet Nam,2010,23062[21000-26000],45[2-80],23062,21000.0,26000.0,45,2.0,80.0,Western Pacific
853,Yemen,2010,1134927[611000-2686000],2874[90-8490],1134927,611000.0,2686000.0,2874,90.0,8490.0,Eastern Mediterranean
854,Zambia,2010,2169307[1449000-3095000],6544[5580-7510],2169307,1449000.0,3095000.0,6544,5580.0,7510.0,Africa


**Let's first clean up the column names as it seems really messy**

In [20]:
new_columns = ['Country','Year','Number of cases','Number of deaths','Number of cases - median',
               'Number of cases - min','Number of cases - max','Number of deaths - median','Number of deaths - min', 
               'Number of deaths - max', 'WHO Region']

df.columns = new_columns
df.head(1)

Unnamed: 0,Country,Year,Number of cases,Number of deaths,Number of cases - median,Number of cases - min,Number of cases - max,Number of deaths - median,Number of deaths - min,Number of deaths - max,WHO Region
0,Afghanistan,2017,630308[495000-801000],298[110-510],630308,495000.0,801000.0,298,110.0,510.0,Eastern Mediterranean


There are a total of 856 datapoints with 11 attributes as follows: 

- There are a total of **856 datapoints** with 11 attributes as follows: 
- `Country` - The Country Name                      
- `Year` - The year of the the malaria statistic belongs to                        
- `Number of cases` - The total number of malaria cases             
- `Number of deaths` - The total number of deaths due to malaria             
- `Number of cases - median` - The median number of malaria cases       
- `Number of cases - min` - The minimum number of malaria cases         
- `Number of cases - max` - The maximum number of malaria cases      
- `Number of deaths - median` - The median number of malaria deaths     
- `Number of deaths - min` - The minimum number of malaria deaths        
- `Number of deaths - max` - The maximum number of malaria deaths
- `WHO Region` - The region looked over by WHO                    

In [21]:
# Let's take a look at the basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856 entries, 0 to 855
Data columns (total 11 columns):
Country                      856 non-null object
Year                         856 non-null int64
Number of cases              856 non-null object
Number of deaths             856 non-null object
Number of cases - median     856 non-null int64
Number of cases - min        544 non-null float64
Number of cases - max        544 non-null float64
Number of deaths - median    856 non-null int64
Number of deaths - min       524 non-null float64
Number of deaths - max       524 non-null float64
WHO Region                   856 non-null object
dtypes: float64(4), int64(3), object(4)
memory usage: 73.7+ KB


It seems like the datatypes were not read in correctly. Since we're going to have to change a few, let's just give it a list of datatypes to replace these. 

**Duplication check**

In [22]:
# Let's check if there are duplicates
df.duplicated().sum()

0

**Null and missing value check**

In [23]:
# Checking for missing values
df.isna().sum()

Country                        0
Year                           0
Number of cases                0
Number of deaths               0
Number of cases - median       0
Number of cases - min        312
Number of cases - max        312
Number of deaths - median      0
Number of deaths - min       332
Number of deaths - max       332
WHO Region                     0
dtype: int64

**It seems like we've got the following missing values:**
- `No. of cases_min` - missing 312 values
- `No. of cases_max` - missing 312 values
- `No. of deaths_min` - missing 332 values
- `No. of deaths_max`- missing 332 values

**However, while going through the dataset, we did realize some of these values can be imputed from other columns as we did notice that these missing values were overlapped in number of cases and deaths columns**

In [24]:
df.head(10)

Unnamed: 0,Country,Year,Number of cases,Number of deaths,Number of cases - median,Number of cases - min,Number of cases - max,Number of deaths - median,Number of deaths - min,Number of deaths - max,WHO Region
0,Afghanistan,2017,630308[495000-801000],298[110-510],630308,495000.0,801000.0,298,110.0,510.0,Eastern Mediterranean
1,Algeria,2017,0,0,0,,,0,,,Africa
2,Angola,2017,4615605[3106000-6661000],13316[9970-16600],4615605,3106000.0,6661000.0,13316,9970.0,16600.0,Africa
3,Argentina,2017,0,0,0,,,0,,,Americas
4,Armenia,2017,0,0,0,,,0,,,Europe
5,Azerbaijan,2017,0,0,0,,,0,,,Europe
6,Bangladesh,2017,32924[30000-36000],76[3-130],32924,30000.0,36000.0,76,3.0,130.0,South-East Asia
7,Belize,2017,7,0,7,,,0,,,Americas
8,Benin,2017,4111699[2774000-6552000],7328[5740-8920],4111699,2774000.0,6552000.0,7328,5740.0,8920.0,Africa
9,Bhutan,2017,11,0,11,,,0,,,South-East Asia


In [26]:
# Creating four empty lists to collect median, minimum, maximum cases, and a min_max list to get the range
number_of_cases_median = []
number_of_cases_minimum = []
number_of_cases_maximum = []
min_max = []

# Creating a loop which goes through each row within the column 
for cell in df['Number of cases']:
    cell = df['Number of cases'][i].split('[')
    cases_number = cell[0]
    number_of_cases_median.append(cases_number)
    for j in range(0,856,1):
    cell = df['Number of cases'][j].rstrip(']').split('[')
    if len(cell) == 1:
        min_max_number = cell[0]
    elif len(cell) == 2:
        min_max_number = cell[1]
        min_max.append(min_max_number)

# This loop will pass through the min_max list and split the values on '-' and puts the first in minimum and second max for number in min
    

IndentationError: expected an indented block (<ipython-input-26-46a9b9e05b48>, line 13)

In [8]:

# This loop goes through the min_max list and splits the values on '-' and puts the first in minimum and second in max
for number in min_max:
    for minimum in range(0,856,1):
        min_number = min_max[minimum].split('-')[0]
        number_of_cases_minimum.append(min_number)
    for maximum in range(0,856,1):
        if len(min_max[maximum].split('-')) == 1:
            pass
        elif len(min_max[maximum].split('-')) == 2:
            min_number = min_max[maximum].split('-')[0]
            max_number = min_max[maximum].split('-')[1]
            number_of_cases_minimum.append(min_number)
            number_of_cases_maximum.append(max_number)       
        
        
# Converting the lists to series
number_of_cases_median = pd.Series(number_of_cases_median)
number_of_cases_minimum = pd.Series(number_of_cases_minimum)
number_of_cases_maximum = pd.Series(number_of_cases_maximum)


# Putting everything in a dataframe
cases_columns = ['Number of deaths - Median','Number of deaths - Minimum','Number of deaths - Maximum']
data = [number_of_deaths_median, number_of_deaths_minimum, number_of_deaths_maximum]

malaria_deaths = pd.DataFrame(data = data, columns = cases_columns)
malaria_deaths.head()

CPU times: user 23.4 s, sys: 261 ms, total: 23.7 s
Wall time: 24.4 s


In [None]:
%%time
# Creating three empty lists for number of cases, number of cases min, number of cases max
number_of_deaths_median = []
min_max = []
number_of_deaths_minimum = []
number_of_deaths_maximum = []

# Create a loop which goes through the number of cases
for cell in df['Number of deaths']:
    for i in range(0,856,1):
        cell = df['Number of deaths'][i].split('[')
        deaths_number = cell[0]
        number_of_deaths_median.append(cases_number)
    for j in range(0,856,1):
        cell = df['Number of deaths'][j].rstrip(']').split('[')
        if len(cell) == 1:
            min_max_number = cell[0]
        elif len(cell) == 2:
            min_max_number = cell[1]
        min_max.append(min_max_number)

        
for number in min_max:
    for minimum in range(0,856,1):
        min_number = min_max[minimum].split('-')[0]
        number_of_cases_minimum.append(min_number)
    for maximum in range(0,856,1):
        if len(min_max[maximum].split('-')) == 1:
            pass
        elif len(min_max[maximum].split('-')) == 2:
            min_number = min_max[maximum].split('-')[0]
            max_number = min_max[maximum].split('-')[1]
            number_of_deaths_minimum.append(min_number)
            number_of_deaths_maximum.append(max_number)
            
# Let's now put everything in a dataframe
cases_columns = ['Number of deaths - Median','Number of deaths - Minimum','Number of deaths - Maximum']
data = [number_of_deaths_median, number_of_deaths_minimum, number_of_deaths_maximum]

malaria_deaths = pd.DataFrame(data = data, columns = cases_columns)
malaria_deaths.head()

Now we're going to do a simple `value_counts()` and put the name and country count in a DataFrame

In [None]:
#Let's look at the distribtuion of value counts
Country_DataFrame = pd.DataFrame(df['Country'].value_counts()) 
Country_DataFrame.reset_index(inplace = True)
Country_DataFrame.rename(columns = {'index': 'Country Name', 'Country' : 'Value Counts'}, inplace = True)
Country_DataFrame.sort_values(by = 'Value Counts', ascending = False)
Country_DataFrame

**When we looked at the data frame information, it seemed like there were some missing values. Let's take a look at it**

In [None]:
# Checking for missing values
df.isna().sum()

**Lets take a look at the Year column and split it out into how these cases were discovered**

In [None]:
df.head()

In [None]:
# Splits the time stamps into years, month and day names
df['Year of Case'] = df['Year'].dt.year
df['Month of Case'] = df['Month Name'].dt.month_name()
df['Day of Case'] = df['Day Name'].dt.day_name()


In [None]:
plt.figure()
plt.hist(df['Year'])
plt.show()

In [None]:
plt.figure()
df['No. of deaths_max'].groupby(by = df['Country']).plot(kind = 'bar')

plt.show()

In [None]:
df['Country'].nunique()

In [None]:
plt.figure()
sns.countplot(df['Year'], color = 'green')
plt.xlabel('Years')
plt.ylabel('Malaria Count')
plt.show()

In [None]:
plt.figure(figsize = (10,10))
df['No. of deaths_median'].groupby(by = df['Year']).plot()
plt.show()