<a href="https://colab.research.google.com/github/TsamayaDesigns/codeDivision-data-with-python/blob/main/Data_cleaning_with_normalisation_challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Normalisation challenges

### Challenge 1 - prepare dataset for normalisation
---
1. Read Covid vaccination data from the `by_country` sheet in the Excel file at this link : https://github.com/lilaceri/Working-with-data-/blob/342abab10d93c4bf23b5c55a50f189f12a137c5f/Data%20Sets%20for%20code%20divisio/Covid%20Vaccination%20Data.xlsx?raw=true
2. Use .info() to find out which columns have missing values
3. Remove all rows with missing data in the total_vaccinations column
4. Find the median total vaccinations per hundred
5. display the mean people vaccinated per hundred for each country in descending order
6. find the range of total_vaccinations across the dataframe


**Test output**:  
1. dataframe is saved in a variable
2.
```
RangeIndex: 14994 entries, 0 to 14993
Data columns (total 15 columns):
    Column                               Non-Null Count  Dtype         
                                
 0   country                              14994 non-null  object        
 1   iso_code                             14994 non-null  object        
 2   date                                 14994 non-null  datetime64[ns]
 3   total_vaccinations                   9011 non-null   float64       
 4   people_vaccinated                    8370 non-null   float64       
 5   people_fully_vaccinated              6158 non-null   float64       
 6   daily_vaccinations_raw               7575 non-null   float64       
 7   daily_vaccinations                   14796 non-null  float64       
 8   total_vaccinations_per_hundred       9011 non-null   float64       
 9   people_vaccinated_per_hundred        8370 non-null   float64       
 10  people_fully_vaccinated_per_hundred  6158 non-null   float64       
 11  daily_vaccinations_per_million       14796 non-null  float64       
 12  vaccines                             14994 non-null  object        
 13  source_name                          14994 non-null  object        
 14  source_website                       14994 non-null  object        
dtypes: datetime64[ns](1), float64(9), object(5)
memory usage: 1.7+ MB
```
3. 9011 rows × 15 columns
4. 6.3
5.
```
country
Gibraltar                           64.975699
Bhutan                              55.961892
Falkland Islands                    51.063333
Saint Helena                        44.880000
Seychelles                          44.005686
                                      ...    
China                                     NaN
Ethiopia                                  NaN
Saint Vincent and the Grenadines          NaN
Samoa                                     NaN
Saudi Arabia                              NaN
Name: people_vaccinated_per_hundred, Length: 195, dtype: float64
```
6. 275338000.0


In [2]:
# Retrieve Data
import pandas as pd
pd.set_option('display.width', 240)

def get_excel_data():
  url = "https://github.com/lilaceri/Working-with-data-/blob/342abab10d93c4bf23b5c55a50f189f12a137c5f/Data%20Sets%20for%20code%20divisio/Covid%20Vaccination%20Data.xlsx?raw=true"
  df = pd.read_excel(url, sheet_name = "by_country")
  return df

# 1. Save dataframe to a variable ("data")
data = get_excel_data()

In [3]:
# Inspect Data
def inspect_data():
  print("\nChallenge 1 - Prepare dataset for normalisation\n")
  print("2. Use .info() to find columns with missing values\n")
  data.info()

  # *** EXTRA *** Below is extra steps to inspect the dataframe *** EXTRA ***
  # 2.2 Use .isna().any() to find out which columns have missing values
  # print("\n*** EXTRA *** Columns with missing values: \n", data.isna().any())

  # 2.3 Use .isna().sum() to find out number of missing values by column
  # print("\n*** EXTRA *** Missing values per column: \n", data.isna().sum())

inspect_data()


Challenge 1 - Prepare dataset for normalisation

2. Use .info() to find columns with missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14994 entries, 0 to 14993
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   country                              14994 non-null  object        
 1   iso_code                             14994 non-null  object        
 2   date                                 14994 non-null  datetime64[ns]
 3   total_vaccinations                   9011 non-null   float64       
 4   people_vaccinated                    8370 non-null   float64       
 5   people_fully_vaccinated              6158 non-null   float64       
 6   daily_vaccinations_raw               7575 non-null   float64       
 7   daily_vaccinations                   14796 non-null  float64       
 8   total_vaccinations_per_hundred       9011 non-null   f

In [4]:
# Clean Data
def clean_data():
  # 3. Remove all rows with missing data in the "total_vaccinations" column
  df_tot_vac_dropna = data.dropna(subset="total_vaccinations")

  print("\n3. Remove all rows with missing data in the \"total_vaccinations\" column\n")
  df_tot_vac_dropna.info()

  # *** EXTRA *** Below is extra steps to inspect the dataframe *** EXTRA ***
  # print("\nColumns with missing values (note \"total_vaccinations\"): ")
  # print(df_tot_vac_dropna.isna().any())

  return df_tot_vac_dropna

clean_data = clean_data()


3. Remove all rows with missing data in the "total_vaccinations" column

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9011 entries, 0 to 14993
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   country                              9011 non-null   object        
 1   iso_code                             9011 non-null   object        
 2   date                                 9011 non-null   datetime64[ns]
 3   total_vaccinations                   9011 non-null   float64       
 4   people_vaccinated                    8294 non-null   float64       
 5   people_fully_vaccinated              6157 non-null   float64       
 6   daily_vaccinations_raw               7575 non-null   float64       
 7   daily_vaccinations                   8816 non-null   float64       
 8   total_vaccinations_per_hundred       9011 non-null   float64       
 9   people_vac

In [5]:
# Analyse Data
def analyse_data(clean_data):
  # 4. Find the median total vaccinations per hundred
  med_vac_per_100 = clean_data["total_vaccinations_per_hundred"].median()
  print("\n4. Find the median total vaccinations per hundred: \n", med_vac_per_100)

  # 5. Display the mean people vaccinated per hundred for each country in descending order
  mean_people_vac_per_100 = clean_data.groupby("country")["people_vaccinated_per_hundred"].mean().sort_values(ascending = False)
  print("\n5. Display the mean people vaccinated per hundred for each country in descending order: \n", mean_people_vac_per_100)

  # 6. Find the range of total_vaccinations across the dataframe
  range_tot_vacs = clean_data["total_vaccinations"].max() - clean_data["total_vaccinations"].min()
  print("\n6. Find the range of total_vaccinations across the dataframe: \n", range_tot_vacs)

analyse_data(clean_data)


4. Find the median total vaccinations per hundred: 
 6.3

5. Display the mean people vaccinated per hundred for each country in descending order: 
 country
Gibraltar                           64.975699
Bhutan                              55.961892
Falkland Islands                    51.063333
Saint Helena                        44.880000
Seychelles                          44.005686
                                      ...    
China                                     NaN
Ethiopia                                  NaN
Saint Vincent and the Grenadines          NaN
Samoa                                     NaN
Saudi Arabia                              NaN
Name: people_vaccinated_per_hundred, Length: 195, dtype: float64

6. Find the range of total_vaccinations across the dataframe: 
 275338000.0


### Challenge 2 - normalise daily vaccinations
---

1. Find the median daily vaccinations per million
2. Write a function to normalise daily vaccinations per million, where values greater than or equal to median = 1 and values less than median = 0

**Test output**:

1. 1475.0
2. using describe()
```
count    14994.000000
mean         0.493464
std          0.499974
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: daily_vaccinations_per_million, dtype: float64
```

In [20]:
# 1. Find the median daily vaccinations per million
med_dly_vacs_per_mil = data["daily_vaccinations_per_million"].median()
print("\n1. Find the median daily vaccinations per million: \n", med_dly_vacs_per_mil)



1. Find the median daily vaccinations per million: 
 1475.0


### Challenge 3 - Normalising total vaccinations   
---
The United Kingdom has been praised for its fast vaccine rollout.
1. Find the minimum total vaccinations for the United Kingdom
2. Save this value in a variable rounded down to an integer
3. Write a function to normalise total_vaccinations column so that all values less than the UK's min are 0 and all values greater than or equal to the UK's min are coded as 1
4. Display the countries for which total vaccinated is at the same rate or more than the UK

**Test output**:

1. 1402432.0
2. 1402432
3. `df['total_vaccinations']` should output:
```
0        0
6        0
22       0
44       0
59       0
        ..
14989    0
14990    0
14991    0
14992    0
14993    0
Name: total_vaccinations, Length: 9011, dtype: int64
```
4.
```
array(['Argentina', 'Australia', 'Austria', 'Azerbaijan', 'Bangladesh',
       'Belgium', 'Brazil', 'Cambodia', 'Canada', 'Chile', 'China',
       'Colombia', 'Czechia', 'Denmark', 'Dominican Republic', 'England',
       'Finland', 'France', 'Germany', 'Greece', 'Hong Kong', 'Hungary',
       'India', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan',
       'Kazakhstan', 'Malaysia', 'Mexico', 'Morocco', 'Nepal',
       'Netherlands', 'Norway', 'Pakistan', 'Peru', 'Philippines',
       'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Saudi Arabia',
       'Scotland', 'Serbia', 'Singapore', 'Slovakia', 'South Korea',
       'Spain', 'Sweden', 'Switzerland', 'Thailand', 'Turkey',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'Uruguay', 'Wales'], dtype=object)
```




### Challenge 4 - create new series of total vaccinations for each manufacturer
---

To create a new column in your dataframe:

`df['new_column'] = ...`

For example:

* to duplicate an existing column
  * `df['new_column'] = df['old_column']`
* to add two columns together
  * `df['new_column'] = df['column1'] + df['column2']`
* to make a percentages column
  * `df['new_column'] = (df['column1']/df['column1].sum()) * 100`

  
1. read data from 'by_manufacturer' sheet from Covid data
2. find the sum of total vaccinations for each manufacturer
3. create a new column that has the total vaccinations as a percentage of the overall sum of total vaccinations
4. find the median percentage
5. create a new column called 'normalised_percentages' which duplicates the percentages column
6. normalise the normalised_percentages column so that any values greater than or equal to the median percentage = 1 and any lesser than = 0


**Test output**:

1.
2.
```
vaccine
Johnson&Johnson        264839828
Moderna               5548036383
Oxford/AstraZeneca     539433203
Pfizer/BioNTech       8690461304
Sinovac                604660293
Name: total_vaccinations, dtype: int64
```
3.
```
	location	date	vaccine	total_vaccinations	percentages
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055
...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519
3294	United States	2021-05-02	Moderna	106780082	0.682413
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423
3296 rows × 5 columns
```
4. 0.0011110194374896931
5.
6.
```
	location	date	vaccine	total_vaccinations	percentages	normalise	normalised
0	Chile	2020-12-24	Pfizer/BioNTech	420	0.000003	0.000003	0
1	Chile	2020-12-25	Pfizer/BioNTech	5198	0.000033	0.000033	0
2	Chile	2020-12-26	Pfizer/BioNTech	8338	0.000053	0.000053	0
3	Chile	2020-12-27	Pfizer/BioNTech	8649	0.000055	0.000055	0
4	Chile	2020-12-28	Pfizer/BioNTech	8649	0.000055	0.000055	0
...	...	...	...	...	...	...	...
3291	United States	2021-05-01	Moderna	105947940	0.677095	0.677095	1
3292	United States	2021-05-01	Pfizer/BioNTech	129013657	0.824504	0.824504	1
3293	United States	2021-05-02	Johnson&Johnson	8374395	0.053519	0.053519	1
3294	United States	2021-05-02	Moderna	106780082	0.682413	0.682413	1
3295	United States	2021-05-02	Pfizer/BioNTech	130252779	0.832423	0.832423	1
3296 rows × 7 columns
```




# Reflection
---




## What skills have you demonstrated in completing this notebook?

Your answer:


## What caused you the most difficulty?

Your answer: