<h1 style="text-align: center;"><a title="Data Science-AIMS-Cmr-2021-22">Chapter 3: 
    Introducing Features and Observations</h1>

**Instructor:** 

* Rockefeller

**Learning Objectives:**

* Understand the importance of structured data and the key principles of tidy data.

* Differentiate between variables and observations in a dataset.

* Learn to identify when and why to reshape datasets.

* Master the usage of the melt function in pandas to transform data from a wide format to a long format.

* Analyze real-world data to detect and rectify structural anomalies.

* Gain hands-on experience in preparing data for further statistical analysis or visualization by ensuring it adheres to the tidy data principles.

# Introduction:

It is often said that 80% of data analysis is spent on the **cleaning** and **preparing** data. And it’s not just a first step, but it must be repeated many times over the course of the analysis as new problems come to light or new data is collected. 

To get a handle on the problem, this part focuses on a small, but important, aspect of data cleaning that we call data **tidying: structuring datasets to facilitate analysis.** 
It also formally introduces the concept of **features** and **observations**.

In [1]:
import pandas as pd

In [None]:
%load john_anna

># <font color=#800080>Task 6:</font> <a class="anchor" id="Task-1"></a>


As the world needs more sustainable and efficient ways to grow food, people are starting to see how helpful artificial intelligence (AI) can be for farming. Because of this, the Zambia Farmers' Federation has partnered with the University of Lusaka's Department of Agriculture. They want to find new ways to help Zambia grow more food.

You have been chosen to be the **Lead data analyst** for this project because you are very good at it. The first goal of the project is to test two new fertilizers to see if they can help crops grow more. Your job is to look carefully at the data, use your analytical skills, and find meaningful insights that will help the project team decide what to do next.

You've just received a detailed report from the leading Agri-expert on the team. Here's the content of their message:


---
>### <font color=#800080> </font> <a class="anchor" id="Task-1"></a>=====================================

*Greetings!*

*In agricultural research, we often call using fertilizers on crops a "treatment". I have tested two different fertilizers on three crops: mangoes, avocados, and pineapples. The first fertilizer, Axida (Treatment A), is mostly made of organic compounds that are high in nitrogen. The second fertilizer, Bross (Treatment B), is mostly made of minerals that are high in potassium and has added micro-nutrients. One of the interesting things we measure is how much gas the crops emit after the fertilizer is applied. This can tell us how the plants are responding to the fertilizers.*


*Here are the specifics:*

- **For Axida (Treatment A)**:
  - Mango: **4.5** units of gas emission
  - Avocado: **2.1** units of gas emission
  - Pineapple: **1.9** units of gas emission

- **For Bross (Treatment B)**:
  - Mango: **5.1** units of gas emission
  - Avocado: **1.3** units of gas emission
  - Pineapple: **5.3** units of gas emission

*I eagerly await your expert analysis on this data. Let's make a significant impact together!*

>### <font color=#800080> </font> <a class="anchor" id="Task-1"></a>=====================================
---

1. Plants have always had special ways of interacting with their surroundings and with each other. Can you think of ways that plants might "talk" to each other? What scientific reasons could there be for these things to happen?


2.  Translate the information in the email that the agricultural expert sent to you into a form that can be used for analysis.


3. Two other analysts Anna and Jonas have translated that email into the sheets below. Run the following python code below `%load john_annah.py` and tell us what you observe.

In [4]:
# %load john_annah.py
annah_df =  pd.read_csv('data/Annah_data.csv')
jonas_df = pd.read_csv('data/Jonas_data.csv')


In [8]:
annah_df

Unnamed: 0,Fruits,Treatment A,Treatment B
0,Mango,4.5,2.1
1,Avocado,2.1,1.3
2,Pineapple,1.9,5.3


In [10]:
annah_df_tidy = pd.melt(annah_df, id_vars = ['Fruits'], var_name='Traitement',value_name='Gas_Unit')
annah_df_tidy

Unnamed: 0,Fruits,Traitement,Gas_Unit
0,Mango,Treatment A,4.5
1,Avocado,Treatment A,2.1
2,Pineapple,Treatment A,1.9
3,Mango,Treatment B,2.1
4,Avocado,Treatment B,1.3
5,Pineapple,Treatment B,5.3


Note that this type of data might be good for presentation but it is not tidy for analysis.


## Uniformizing the concept of variables and observations

The idea here is to give a standard way to organize the data values within the dataset. 
To formalize the concept of rows and columns so that the analyst will get more time to focus on **the interesting domain problem** , not on **the uninteresting logistics of the data**.


 1. Each variable forms a column.

 2. Each observation forms a row.

 3. Each type of observational unit forms a table.

Formally, 

- **A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.**
- **An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes**

Some common data problems

```

    Column headers are values, not variable names.

    Multiple variables are stored in one column.

    Variables are stored in both rows and columns.

    Multiple types of observational units are stored in the same table.

    A single observational unit is stored in multiple tables.

```

**Use the uniformization principles described above, to re-organize the above dataset.**

**What insight can you extract from the data?**

In [3]:
import pandas as pd
# %load john_annah.py
annah_df =  pd.read_csv('data/Annah_data.csv')
annah_df

Unnamed: 0,Fruits,Treatment A,Treatment B
0,Mango,4.5,2.1
1,Avocado,2.1,1.3
2,Pineapple,1.9,5.3


In [6]:
tidy_annah_df = pd.melt(annah_df, id_vars=['Fruits'], var_name='Treatment', value_name='Gas_Unit')
tidy_annah_df

Unnamed: 0,Fruits,Treatment,Gas_Unit
0,Mango,Treatment A,4.5
1,Avocado,Treatment A,2.1
2,Pineapple,Treatment A,1.9
3,Mango,Treatment B,2.1
4,Avocado,Treatment B,1.3
5,Pineapple,Treatment B,5.3


In [4]:
jonas_df = pd.read_csv('data/Jonas_data.csv')
jonas_df

Unnamed: 0,Treatment,Mango,Avocado,Pineapple
0,A,4.5,2.1,1.9
1,B,2.1,1.3,5.3


In [7]:
tidy_jonas_df = pd.melt(jonas_df, id_vars=['Treatment'], var_name='Crops', value_name='Gas_Unit')
tidy_jonas_df

Unnamed: 0,Treatment,Crops,Gas_Unit
0,A,Mango,4.5
1,B,Mango,2.1
2,A,Avocado,2.1
3,B,Avocado,1.3
4,A,Pineapple,1.9
5,B,Pineapple,5.3


Even though the logistics of the above data could be repaired manually, pandas has a function called `melt` that can be useful for that process. See below:

It makes uses of three main paramters: The `id_vars` , `var_name`  and `value_name`

* `id_vars` represents the Column(s) to be used as identifier variables.
* `var_name`: Represents the variable that runs across columns header(from left to right)
* `value_name`: The name to use for that aboved identified column

># <font color=#800080>Task 6:</font> <a class="anchor" id="Task-1"></a>

With 60 million active users, **Boomplay** is the most popular music streaming service in Africa. The Chinese-owned, Africa-focused company is available throughout the continent and runs a freemium model. They are planning to open new offices in the County of Zwedru in Liberia. You were lucky enough to secure a fully funded internship with them. On your first day in the office, The Regional Manager stated that they are working on remixing the Classics from the `AfroParade` and distribute them on their platform. The `AfroParade` charts tabulate the relative weekly popularity of songs and albums across Africa. For a first phase, they chose the Classics from the beginning of the millenium: The big year 2000.  The data was scraped from the **AfroParade** database and given to you in a csv file called `best_afro_songs_2000s.csv`.


1. How do you think Music streaming platforms make money if you can listen to music there for free? and How do artists benefit from it?

2. Load the data in  pandas and tell us what you observe. If there is any anomaly, fix it.

In [11]:
import pandas as pd
best_afro_df = pd.read_csv('/home/students-asn17/Data_course/Week_02/Day_02/data/afro_songs_2000s.csv')
best_afro_df

Unnamed: 0,year,artist,track,time,date.entered,wk1,wk2,wk3,wk4,wk5,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
0,2009,Michael Ruiz (Nigeria),Afrobeat - Traditional,3:23,2009-08-07,34,44,53,50,56,...,100,95,95,97,100,100,100,100,100,99
1,2004,Zachary Shepard (South Africa),Benga - Must,4:43,2004-06-06,12,19,22,17,13,...,100,100,100,100,97,100,100,100,100,100
2,2016,Dwayne Mckay (Senegal),Amapiano - Plan,2:02,2016-06-24,13,15,17,12,19,...,97,96,94,95,92,100,100,100,100,98
3,2008,Christopher Rivera (South Africa),Benga - Trade,3:00,2008-01-24,70,70,71,80,82,...,86,95,100,98,100,100,100,100,100,100
4,2010,Amanda Gonzalez (Nigeria),Highlife - The,3:49,2010-02-20,2,1,10,16,26,...,98,98,95,100,98,97,95,100,100,100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
312,2003,Andrew Jensen (South Africa),Benga - Offer,3:36,2003-11-17,58,65,62,61,70,...,100,97,96,97,100,100,100,100,100,95
313,2009,Michael White (South Africa),Amapiano - Happen,4:45,2009-08-20,100,100,100,100,100,...,98,100,100,95,96,100,100,100,100,100
314,2013,Robin Peterson (Senegal),Benga - Little,3:54,2013-11-27,27,27,30,39,39,...,100,100,100,97,94,95,100,100,100,98
315,2021,Debra Morgan (Senegal),Amapiano - Those,3:42,2021-04-14,26,29,30,40,42,...,100,100,100,100,100,96,100,100,95,97


In [21]:
#import pandas as pd
#
best_afro_df = pd.read_csv('data/best_afro_songs_2000s.csv')
best_afro_df

Unnamed: 0,year,Artist,Country,Genre,Title,time,date.entered,wk2,wk3,wk4,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
0,2009,Michael Ruiz,Nigeria,Afrobeat,Traditional,3:23,2009-08-07,44,53,50,...,100,95,95,97,100,100,100,100,100,99
1,2004,Zachary Shepard,South Africa,Benga,Must,4:43,2004-06-06,19,22,17,...,100,100,100,100,97,100,100,100,100,100
2,2016,Dwayne Mckay,Senegal,Amapiano,Plan,2:02,2016-06-24,15,17,12,...,97,96,94,95,92,100,100,100,100,98
3,2008,Christopher Rivera,South Africa,Benga,Trade,3:00,2008-01-24,70,71,80,...,86,95,100,98,100,100,100,100,100,100
4,2010,Amanda Gonzalez,Nigeria,Highlife,The,3:49,2010-02-20,1,10,16,...,98,98,95,100,98,97,95,100,100,100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
312,2003,Andrew Jensen,South Africa,Benga,Offer,3:36,2003-11-17,65,62,61,...,100,97,96,97,100,100,100,100,100,95
313,2009,Michael White,South Africa,Amapiano,Happen,4:45,2009-08-20,100,100,100,...,98,100,100,95,96,100,100,100,100,100
314,2013,Robin Peterson,Senegal,Benga,Little,3:54,2013-11-27,27,30,39,...,100,100,100,97,94,95,100,100,100,98
315,2021,Debra Morgan,Senegal,Amapiano,Those,3:42,2021-04-14,29,30,40,...,100,100,100,100,100,96,100,100,95,97


In [22]:
best_afro_df.head()

Unnamed: 0,year,Artist,Country,Genre,Title,time,date.entered,wk2,wk3,wk4,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
0,2009,Michael Ruiz,Nigeria,Afrobeat,Traditional,3:23,2009-08-07,44,53,50,...,100,95,95,97,100,100,100,100,100,99
1,2004,Zachary Shepard,South Africa,Benga,Must,4:43,2004-06-06,19,22,17,...,100,100,100,100,97,100,100,100,100,100
2,2016,Dwayne Mckay,Senegal,Amapiano,Plan,2:02,2016-06-24,15,17,12,...,97,96,94,95,92,100,100,100,100,98
3,2008,Christopher Rivera,South Africa,Benga,Trade,3:00,2008-01-24,70,71,80,...,86,95,100,98,100,100,100,100,100,100
4,2010,Amanda Gonzalez,Nigeria,Highlife,The,3:49,2010-02-20,1,10,16,...,98,98,95,100,98,97,95,100,100,100


In [23]:
best_afro_df.info

<bound method DataFrame.info of      year               Artist       Country      Genre         Title  time  \
0    2009        Michael Ruiz        Nigeria  Afrobeat    Traditional  3:23   
1    2004     Zachary Shepard   South Africa     Benga           Must  4:43   
2    2016        Dwayne Mckay        Senegal  Amapiano           Plan  2:02   
3    2008  Christopher Rivera   South Africa     Benga          Trade  3:00   
4    2010     Amanda Gonzalez        Nigeria  Highlife            The  3:49   
..    ...                  ...           ...        ...           ...   ...   
312  2003       Andrew Jensen   South Africa     Benga          Offer  3:36   
313  2009       Michael White   South Africa  Amapiano         Happen  4:45   
314  2013      Robin Peterson        Senegal     Benga         Little  3:54   
315  2021        Debra Morgan        Senegal  Amapiano          Those  3:42   
316  2010        Ronald Reyes       Ethiopia  Highlife            Art  4:24   

    date.entered  w

In [24]:
best_afro_df.tail()

Unnamed: 0,year,Artist,Country,Genre,Title,time,date.entered,wk2,wk3,wk4,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
312,2003,Andrew Jensen,South Africa,Benga,Offer,3:36,2003-11-17,65,62,61,...,100,97,96,97,100,100,100,100,100,95
313,2009,Michael White,South Africa,Amapiano,Happen,4:45,2009-08-20,100,100,100,...,98,100,100,95,96,100,100,100,100,100
314,2013,Robin Peterson,Senegal,Benga,Little,3:54,2013-11-27,27,30,39,...,100,100,100,97,94,95,100,100,100,98
315,2021,Debra Morgan,Senegal,Amapiano,Those,3:42,2021-04-14,29,30,40,...,100,100,100,100,100,96,100,100,95,97
316,2010,Ronald Reyes,Ethiopia,Highlife,Art,4:24,2010-12-08,74,79,77,...,100,100,100,100,100,100,97,100,100,96


In [25]:
best_afro_df.columns

Index(['year', 'Artist', 'Country', 'Genre', 'Title', 'time', 'date.entered',
       'wk2', 'wk3', 'wk4', 'wk5', 'wk6', 'wk7', 'wk8', 'wk9', 'wk10', 'wk11',
       'wk12', 'wk13', 'wk14', 'wk15', 'wk16', 'wk17', 'wk18', 'wk19', 'wk20',
       'wk21', 'wk22', 'wk23', 'wk24', 'wk25', 'wk26', 'wk27', 'wk28', 'wk29',
       'wk30', 'wk31', 'wk32', 'wk33', 'wk34', 'wk35', 'wk36', 'wk37', 'wk38',
       'wk39', 'wk40', 'wk41', 'wk42', 'wk43', 'wk44', 'wk45', 'wk46', 'wk47',
       'wk48', 'wk49', 'wk50', 'wk51', 'wk52', 'wk53', 'wk54', 'wk55', 'wk56',
       'wk57', 'wk58', 'wk59', 'wk60', 'wk61', 'wk62', 'wk63', 'wk64', 'wk65',
       'wk66', 'wk67', 'wk68', 'wk69', 'wk70', 'wk71', 'wk72', 'wk73', 'wk74',
       'wk75', 'wk76'],
      dtype='object')

In [27]:
best_afro_df_tidy = pd.melt(best_afro_df,id_vars = [ 'year', 'Artist', 'Country', 'Genre', 'Title', 'time', 'date.entered'],
                            var_name = 'Week_No', value_name = 'Ranking')

In [28]:
best_afro_df_tidy

Unnamed: 0,year,Artist,Country,Genre,Title,time,date.entered,Week_No,Ranking
0,2009,Michael Ruiz,Nigeria,Afrobeat,Traditional,3:23,2009-08-07,wk2,44
1,2004,Zachary Shepard,South Africa,Benga,Must,4:43,2004-06-06,wk2,19
2,2016,Dwayne Mckay,Senegal,Amapiano,Plan,2:02,2016-06-24,wk2,15
3,2008,Christopher Rivera,South Africa,Benga,Trade,3:00,2008-01-24,wk2,70
4,2010,Amanda Gonzalez,Nigeria,Highlife,The,3:49,2010-02-20,wk2,1
...,...,...,...,...,...,...,...,...,...
23770,2003,Andrew Jensen,South Africa,Benga,Offer,3:36,2003-11-17,wk76,95
23771,2009,Michael White,South Africa,Amapiano,Happen,4:45,2009-08-20,wk76,100
23772,2013,Robin Peterson,Senegal,Benga,Little,3:54,2013-11-27,wk76,98
23773,2021,Debra Morgan,Senegal,Amapiano,Those,3:42,2021-04-14,wk76,97


In [20]:
best_afro_df_tidy.head(5)

Unnamed: 0,year,artist,track,time,date.entered,Week_No,Ranking
0,2009,Michael Ruiz (Nigeria),Afrobeat - Traditional,3:23,2009-08-07,wk1,34
1,2004,Zachary Shepard (South Africa),Benga - Must,4:43,2004-06-06,wk1,12
2,2016,Dwayne Mckay (Senegal),Amapiano - Plan,2:02,2016-06-24,wk1,13
3,2008,Christopher Rivera (South Africa),Benga - Trade,3:00,2008-01-24,wk1,70
4,2010,Amanda Gonzalez (Nigeria),Highlife - The,3:49,2010-02-20,wk1,2


># <font color=#800080>Task 7:</font> <a class="anchor" id="Task-7"></a>

**Bindura** is a small town in the Mashonaland Central province of Zimbabwe, located in the North-East of Harare. At Howard Hospital, a small medical facility in Bindura, the number of people with tuberculosis (TB) increased by 35% in 2008, compared to the average number of people with TB from 2003 to 2007.

Under the **Makeba Funding initiative**, which encourages African medical institutions to share data, a team of research scientists from Hôpital Général de Befelatanana in Antananarivo has developed a new drug to treat patients with severe TB symptoms, such as fatigue, chest pain, fever, and cough. As a data analyst, you have been chosen to join the team traveling to Bindura to study the drug's side effects on patients.

At Howard Hospital, the drug has been given to 40 patients, both men and women, aged between 19 and 46. The team has monitored the patients' fatigue levels for 100 days and recorded the results in an `csv file`. The data includes fatigue levels ranging from 0 to 10, where 0 means no signs of fatigue and 10 means extreme fatigue.

The data file, `bindura_tb_patients.csv`, contains the relevant information, and you are assigned to work with it.

1. Do you know how Tuberculosis spread out from person to person?

2. Load the data file and list all the anomalies that you observe

3. Use the melt function to fix the inconsistencies within the data

4. What insights can you extract from the data?

# EXERCICE A TIDY

In [2]:
import pandas as pd
bindura_df = pd.read_csv('/home/students-asn17/Data_course/Week_02/Day_02/data/bindura_tb_patients.csv')
bindura_df

Unnamed: 0,days,male_1,male_2,male_3,male_4,male_5,male_6,male_7,male_8,male_9,...,female_11,female_12,female_13,female_14,female_15,female_16,female_17,female_18,female_19,female_20
0,1,4,0,9,5,5,0,9,9,9,...,7,7,7,7,2,4,3,1,2,9
1,2,2,6,7,1,7,2,1,0,0,...,3,3,6,9,6,4,9,8,2,0
2,3,5,0,5,5,9,2,5,2,8,...,8,6,6,3,5,3,3,4,1,6
3,4,9,5,7,3,4,4,3,7,4,...,8,3,5,0,2,7,1,7,1,6
4,5,3,2,2,3,7,5,4,9,5,...,1,7,3,6,0,3,8,9,5,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,1,9,2,6,5,0,4,8,0,...,7,0,4,2,0,3,3,5,7,7
96,97,3,4,5,1,7,1,8,7,0,...,2,2,2,5,7,4,7,3,4,0
97,98,8,6,1,3,2,1,0,5,6,...,3,0,2,0,2,1,3,9,3,0
98,99,2,9,4,7,8,0,7,9,0,...,5,8,2,8,2,4,7,1,9,0


In [3]:
bindura_df.columns

Index(['days', 'male_1', 'male_2', 'male_3', 'male_4', 'male_5', 'male_6',
       'male_7', 'male_8', 'male_9', 'male_10', 'male_11', 'male_12',
       'male_13', 'male_14', 'male_15', 'male_16', 'male_17', 'male_18',
       'male_19', 'male_20', 'female_1', 'female_2', 'female_3', 'female_4',
       'female_5', 'female_6', 'female_7', 'female_8', 'female_9', 'female_10',
       'female_11', 'female_12', 'female_13', 'female_14', 'female_15',
       'female_16', 'female_17', 'female_18', 'female_19', 'female_20'],
      dtype='object')

In [35]:
all_col = bindura_df.columns.to_list()
all_col

['days',
 'male_1',
 'male_2',
 'male_3',
 'male_4',
 'male_5',
 'male_6',
 'male_7',
 'male_8',
 'male_9',
 'male_10',
 'male_11',
 'male_12',
 'male_13',
 'male_14',
 'male_15',
 'male_16',
 'male_17',
 'male_18',
 'male_19',
 'male_20',
 'female_1',
 'female_2',
 'female_3',
 'female_4',
 'female_5',
 'female_6',
 'female_7',
 'female_8',
 'female_9',
 'female_10',
 'female_11',
 'female_12',
 'female_13',
 'female_14',
 'female_15',
 'female_16',
 'female_17',
 'female_18',
 'female_19',
 'female_20']

In [36]:
all_col.index('male_20')

20

In [37]:
male_cols = all_col[0:21]
male_cols

['days',
 'male_1',
 'male_2',
 'male_3',
 'male_4',
 'male_5',
 'male_6',
 'male_7',
 'male_8',
 'male_9',
 'male_10',
 'male_11',
 'male_12',
 'male_13',
 'male_14',
 'male_15',
 'male_16',
 'male_17',
 'male_18',
 'male_19',
 'male_20']

In [43]:
male_df = bindura_df[all_col]
male_df

Unnamed: 0,days,male_1,male_2,male_3,male_4,male_5,male_6,male_7,male_8,male_9,...,female_11,female_12,female_13,female_14,female_15,female_16,female_17,female_18,female_19,female_20
0,1,4,0,9,5,5,0,9,9,9,...,7,7,7,7,2,4,3,1,2,9
1,2,2,6,7,1,7,2,1,0,0,...,3,3,6,9,6,4,9,8,2,0
2,3,5,0,5,5,9,2,5,2,8,...,8,6,6,3,5,3,3,4,1,6
3,4,9,5,7,3,4,4,3,7,4,...,8,3,5,0,2,7,1,7,1,6
4,5,3,2,2,3,7,5,4,9,5,...,1,7,3,6,0,3,8,9,5,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,1,9,2,6,5,0,4,8,0,...,7,0,4,2,0,3,3,5,7,7
96,97,3,4,5,1,7,1,8,7,0,...,2,2,2,5,7,4,7,3,4,0
97,98,8,6,1,3,2,1,0,5,6,...,3,0,2,0,2,1,3,9,3,0
98,99,2,9,4,7,8,0,7,9,0,...,5,8,2,8,2,4,7,1,9,0


In [44]:
male_df.head(5)

Unnamed: 0,days,male_1,male_2,male_3,male_4,male_5,male_6,male_7,male_8,male_9,...,female_11,female_12,female_13,female_14,female_15,female_16,female_17,female_18,female_19,female_20
0,1,4,0,9,5,5,0,9,9,9,...,7,7,7,7,2,4,3,1,2,9
1,2,2,6,7,1,7,2,1,0,0,...,3,3,6,9,6,4,9,8,2,0
2,3,5,0,5,5,9,2,5,2,8,...,8,6,6,3,5,3,3,4,1,6
3,4,9,5,7,3,4,4,3,7,4,...,8,3,5,0,2,7,1,7,1,6
4,5,3,2,2,3,7,5,4,9,5,...,1,7,3,6,0,3,8,9,5,9


In [52]:
tidy_male_df = pd.melt(male_df, id_vars=['days'], var_name='patients_ID', value_name='Faligue_level')
tidy_male_df

Unnamed: 0,days,patients_ID,Faligue_level
0,1,male_1,4
1,2,male_1,2
2,3,male_1,5
3,4,male_1,9
4,5,male_1,3
...,...,...,...
3995,96,female_20,7
3996,97,female_20,0
3997,98,female_20,0
3998,99,female_20,0


In [53]:
bindura_df.columns

Index(['days', 'male_1', 'male_2', 'male_3', 'male_4', 'male_5', 'male_6',
       'male_7', 'male_8', 'male_9', 'male_10', 'male_11', 'male_12',
       'male_13', 'male_14', 'male_15', 'male_16', 'male_17', 'male_18',
       'male_19', 'male_20', 'female_1', 'female_2', 'female_3', 'female_4',
       'female_5', 'female_6', 'female_7', 'female_8', 'female_9', 'female_10',
       'female_11', 'female_12', 'female_13', 'female_14', 'female_15',
       'female_16', 'female_17', 'female_18', 'female_19', 'female_20'],
      dtype='object')

In [54]:
jonnah_df_tidy = pd.melt(jonas_df,id_vars= ['Treatment'], var_name='Fruits', value_name='Gas_Quantity')

jonnah_df_tidy

Unnamed: 0,Treatment,Fruits,Gas_Quantity
0,A,Mango,4.5
1,B,Mango,2.1
2,A,Avocado,2.1
3,B,Avocado,1.3
4,A,Pineapple,1.9
5,B,Pineapple,5.3


In [55]:
female_cols = [all_col[0]] + all_col[21:]

female_cols

['days',
 'female_1',
 'female_2',
 'female_3',
 'female_4',
 'female_5',
 'female_6',
 'female_7',
 'female_8',
 'female_9',
 'female_10',
 'female_11',
 'female_12',
 'female_13',
 'female_14',
 'female_15',
 'female_16',
 'female_17',
 'female_18',
 'female_19',
 'female_20']

In [56]:
femal_df = bindura_df[female_cols]

In [57]:
tidy_femmel_df = pd.melt(femal_df, id_vars=['days'], var_name= 'patients_ID', value_name= 'fatigue_level')

tidy_femmel_df

Unnamed: 0,days,patients_ID,fatigue_level
0,1,female_1,2
1,2,female_1,6
2,3,female_1,5
3,4,female_1,5
4,5,female_1,5
...,...,...,...
1995,96,female_20,7
1996,97,female_20,0
1997,98,female_20,0
1998,99,female_20,0


In [58]:
tidy_femmel_df.head(5)

Unnamed: 0,days,patients_ID,fatigue_level
0,1,female_1,2
1,2,female_1,6
2,3,female_1,5
3,4,female_1,5
4,5,female_1,5


In [59]:
all_patients_df = pd.concat([tidy_male_df, tidy_femmel_df], axis = 0)

all_patients_df

Unnamed: 0,days,patients_ID,Faligue_level,fatigue_level
0,1,male_1,4.0,
1,2,male_1,2.0,
2,3,male_1,5.0,
3,4,male_1,9.0,
4,5,male_1,3.0,
...,...,...,...,...
1995,96,female_20,,7.0
1996,97,female_20,,0.0
1997,98,female_20,,0.0
1998,99,female_20,,0.0


In [60]:
all_patients_df.sample(5)

Unnamed: 0,days,patients_ID,Faligue_level,fatigue_level
332,33,female_4,,4.0
3730,31,female_18,9.0,
1907,8,male_20,8.0,
2969,70,female_10,9.0,
1765,66,male_18,1.0,


In [62]:
'Nene_Africa'.split('_')

['Nene', 'Africa']

In [63]:
all_patients_df['patients_ID'].str.split('_').str[0]
all_patients_df['ID'] = all_patients_df['patients_ID'].str.split('_').str[1]
all_patients_df.sample(5)

Unnamed: 0,days,patients_ID,Faligue_level,fatigue_level,ID
2716,17,female_8,6.0,,8
3227,28,female_13,8.0,,13
1521,22,female_16,,8.0,16
475,76,female_5,,0.0,5
3087,88,female_11,9.0,,11


In [None]:
all_patients_df['gender'] = all_patients_df['patients_ID'].str

># <font color=#800080>Task 8:</font> <a class="anchor" id="Task-1"></a>


The Covid-19 pandemic has caused a lot of deaths all over the world. As part of the Russia-East Africa Partnership (REAP), the Russian Ministry of Health has made an agreement with government agencies in East Africa to start vaccination campaigns. The Sekou Toure Foundation has been asked to do a big survey in East Africa to collect data on how many people have Covid-19 (active cases) and how many people have died from it (fatalities).

The foundation's staff took strict protective measures, so the survey was only done from `October 2021` to `January 2022`. The data file has now been sent to the Data Science Department of Université polytechnique de Kougouleu in Libreville. They have contacted you because they need your help to understand the data. The data file is called `covid_19_eastafr.csv`.

1. Do you who Sekou Toure was? and what did he do for the continent? 
2. Load the data file and tell us what you observe
3. Use the melt function to fix the inconsistencies within the data
4. What insights can you extract from the data?

In [178]:
east_africa_countries = ['Burundi', 'Comoros', 'Djibouti', 'Eritrea', 
                         'Ethiopia', 'Kenya',  
                         'Rwanda', 'Seychelles', 'Somalia', 'South Sudan', 
                         'Tanzania', 'Uganda',  'North Sudan']


In [64]:
covid_AF_df = pd.read_csv('/home/students-asn17/Data_course/Week_02/Day_02/data/covid_eastafr.csv')

In [66]:
covid_AF_df.head(5)

Unnamed: 0,Day,Cases_Burundi,Cases_Comoros,Cases_Djibouti,Cases_Eritrea,Cases_Ethiopia,Cases_Kenya,Cases_Rwanda,Cases_Seychelles,Cases_Somalia,...,Deaths_Eritrea,Deaths_Ethiopia,Deaths_Kenya,Deaths_Rwanda,Deaths_Seychelles,Deaths_Somalia,Deaths_South Sudan,Deaths_Tanzania,Deaths_Uganda,Deaths_North Sudan
0,2021-10-01,326,363,256,447,212,297,413,311,233,...,185,187,262,301,400,176,292,426,442,334
1,2021-10-05,294,449,201,320,169,382,412,376,315,...,197,349,432,423,353,175,146,203,151,201
2,2021-10-09,433,275,344,212,295,332,340,193,427,...,358,450,285,391,225,400,331,345,191,255
3,2021-10-13,331,374,198,174,299,454,218,182,395,...,210,328,286,268,179,268,409,270,411,204
4,2021-10-17,281,241,176,299,380,218,249,454,185,...,193,266,154,277,183,184,204,371,230,223


In [None]:
tidy_covid_AF_df = covid_AF_df['Status_Country'].str.split()