# TITLE...

# Table of Contents <a id='toc0'></a>  
- 1. [Introduction](#toc1) 
- 2. [Data Description](#toc2)  
  - 2.1. [Importing](#toc2_1)  
  - 2.2. [Cleaning](#toc2_2)  
  - 2.3. [Summary](#toc2_3) #still needs to be done - nicole can do :)
- 3. [Plotting / Correlations etc ...](#toc3)   
- 4. [Conclusion](#toc4)   

## 1. <a id='toc1'></a>[Introduction](#toc0)

...

## 2. <a id='toc2'></a>[Data Description](#toc0)

where we got data from etc.

### 2.1. <a id='toc2_1'></a>[Importing](#toc0)

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"--"})
plt.rcParams.update({'font.size': 14})

####importing datasets
import os 

# Using assert to check that paths exist on computer.
assert os.path.isdir('data/')
assert os.path.isfile('data/chicken-meat-production.xlsx')
assert os.path.isfile('data/ict-adoption-per-100-people.xlsx')

# Print everything in data
os.listdir('data/')


AssertionError: 

### 2.2. <a id='toc2_2'></a>[Cleaning](#toc0)

Here, we clean up the data in terms of columns, rows, removing uncessary data points and making the dataset useable in python. 

# Chicken Production Data

In [None]:
filename1 = 'data/chicken-meat-production.xlsx' # open the file and have a look at it
pd.read_excel(filename1).head(10)

Unnamed: 0,chicken-meat-production,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,Entity,Code,Year,"Meat, chicken | 00001058 || Production | 00551..."
1,Afghanistan,AFG,1961,5600
2,Afghanistan,AFG,1962,6000
3,Afghanistan,AFG,1963,6160
4,Afghanistan,AFG,1964,6400
5,Afghanistan,AFG,1965,6800
6,Afghanistan,AFG,1966,7200
7,Afghanistan,AFG,1967,7600
8,Afghanistan,AFG,1968,8000
9,Afghanistan,AFG,1969,9600


Here, we see a few issues with this datasource: there are some 'unnamed' headings from row 1, and the final column title is unreadable. Also, the code column is not needed - some are country codes, some are other codes, and many are blank

In [None]:
#cleaning 'unnamed rows'
chicken_prod = pd.read_excel(filename1, skiprows=1)

#fixing title heading for tonnes of production
chicken_prod.rename(columns = {'Meat, chicken | 00001058 || Production | 005510 || tonnes':'Chicken Production (per tonne)'}, inplace=True)

#removing 'code' column
drop_these1 = 'Code'
print(drop_these1)

chicken_prod.drop(drop_these1, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made

#clearer table
chicken_prod.head(5)


Code


Unnamed: 0,Entity,Year,Chicken Production (per tonne)
0,Afghanistan,1961,5600.0
1,Afghanistan,1962,6000.0
2,Afghanistan,1963,6160.0
3,Afghanistan,1964,6400.0
4,Afghanistan,1965,6800.0


There is also some data which we would like to remove and edit. 

**(1) Year names**: Having a variable named as a number can cause problems with some functions in python. So, we will alter all dates from '[year]' to 'p[year]'

**(2) Year range**: Our technology consumption's range is only 1960-2021, however, chicken consumption is from 1961-2022. Therefore, 2022 should be removed. 

**(3) Data parameters**: We are interested in different regions on consumption, not individual countries. We will be removing all entities that are counties. This can be done, as we see a 'code' is only for country codes, and black for all others.

In [None]:
#1. Renaming all year rows

#setting up new year names
year_renaming = {str(year): f"p{year}" for year in chicken_prod['Year'].unique()}

#replacing the dataset
chicken_prod['Year'] = chicken_prod['Year'].astype(str).replace(year_renaming)

chicken_prod.head(10)

Unnamed: 0,Entity,Year,Chicken Production (per tonne)
0,Afghanistan,p1961,5600.0
1,Afghanistan,p1962,6000.0
2,Afghanistan,p1963,6160.0
3,Afghanistan,p1964,6400.0
4,Afghanistan,p1965,6800.0
5,Afghanistan,p1966,7200.0
6,Afghanistan,p1967,7600.0
7,Afghanistan,p1968,8000.0
8,Afghanistan,p1969,9600.0
9,Afghanistan,p1970,9600.0


In [None]:
#2. Removing all rows for the year of 2022
chicken_prod = chicken_prod[chicken_prod['Year'] != 'p2022']


In [None]:
#3. Dropping all other individual country data
# Build up a logical index I for all relevant data
I = chicken_prod.Entity.str.contains('Europe')
I |= chicken_prod.Entity.str.contains('European')
I |= chicken_prod.Entity.str.contains('World')
I |= chicken_prod.Entity.str.contains('Income')
I |= chicken_prod.Entity.str.contains('income')

####^we can delete income and others if wnated, i just have them all in the index function for now !!

# Removing all others
chicken_prod = chicken_prod.loc[I == True] 
chicken_prod.head(10)


Unnamed: 0,Entity,Year,Chicken Production (per tonne)
3595,Eastern Europe (FAO),p1961,1098572.0
3596,Eastern Europe (FAO),p1962,1119588.0
3597,Eastern Europe (FAO),p1963,1113712.0
3598,Eastern Europe (FAO),p1964,945150.0
3599,Eastern Europe (FAO),p1965,1039152.0
3600,Eastern Europe (FAO),p1966,1130102.0
3601,Eastern Europe (FAO),p1967,1176226.0
3602,Eastern Europe (FAO),p1968,1254153.0
3603,Eastern Europe (FAO),p1969,1363550.0
3604,Eastern Europe (FAO),p1970,1615866.0


In [None]:
#final clean 

#resetting the index
chicken_prod.reset_index(inplace = True, drop = True) # Drop old index too

chicken_prod.head(10)


Unnamed: 0,Entity,Year,Chicken Production (per tonne)
0,Eastern Europe (FAO),p1961,1098572.0
1,Eastern Europe (FAO),p1962,1119588.0
2,Eastern Europe (FAO),p1963,1113712.0
3,Eastern Europe (FAO),p1964,945150.0
4,Eastern Europe (FAO),p1965,1039152.0
5,Eastern Europe (FAO),p1966,1130102.0
6,Eastern Europe (FAO),p1967,1176226.0
7,Eastern Europe (FAO),p1968,1254153.0
8,Eastern Europe (FAO),p1969,1363550.0
9,Eastern Europe (FAO),p1970,1615866.0


In [None]:
#final check that we have all the data needed
all_groups = chicken_prod['Entity'].unique()
print(all_groups)

all_years = chicken_prod['Year'].unique()
print(all_years)




['Eastern Europe (FAO)' 'Europe' 'Europe (FAO)' 'European Union (27)'
 'European Union (27) (FAO)' 'High-income countries'
 'Low Income Food Deficit Countries (FAO)' 'Low-income countries'
 'Lower-middle-income countries' 'Northern Europe (FAO)'
 'Southern Europe (FAO)' 'Upper-middle-income countries'
 'Western Europe (FAO)' 'World']
['p1961' 'p1962' 'p1963' 'p1964' 'p1965' 'p1966' 'p1967' 'p1968' 'p1969'
 'p1970' 'p1971' 'p1972' 'p1973' 'p1974' 'p1975' 'p1976' 'p1977' 'p1978'
 'p1979' 'p1980' 'p1981' 'p1982' 'p1983' 'p1984' 'p1985' 'p1986' 'p1987'
 'p1988' 'p1989' 'p1990' 'p1991' 'p1992' 'p1993' 'p1994' 'p1995' 'p1996'
 'p1997' 'p1998' 'p1999' 'p2000' 'p2001' 'p2002' 'p2003' 'p2004' 'p2005'
 'p2006' 'p2007' 'p2008' 'p2009' 'p2010' 'p2011' 'p2012' 'p2013' 'p2014'
 'p2015' 'p2016' 'p2017' 'p2018' 'p2019' 'p2020' 'p2021']


This looks good, all individual countries are not listed, and 2022 is not either. 

# Adoption of Technology Data

In [None]:
filename2 = 'data/ict-adoption-per-100-people.xlsx' # open the file and have a look at it
pd.read_excel(filename2).head(5)

Unnamed: 0,ict-adoption-per-100-people,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,Entity,Code,Year,Fixed telephone subscriptions (per 100 people),Fixed broadband subscriptions (per 100 people),Mobile cellular subscriptions (per 100 people),Individuals using the Internet (% of population)
1,Afghanistan,AFG,1960,0.089302,,0,
2,Afghanistan,AFG,1961,0.085584,,,
3,Afghanistan,AFG,1962,0.085584,,,
4,Afghanistan,AFG,1963,0.085584,,,


Data issues: there are some 'unnamed' headings from row 1

In [None]:
#skipping row 1:
tech = pd.read_excel(filename2, skiprows=1)
tech.head(10)

#removing 'code' column
drop_these2 = 'Code'
print(drop_these2)

tech.drop(drop_these2, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made

#clearer table
tech.head(5)

Code


Unnamed: 0,Entity,Year,Fixed telephone subscriptions (per 100 people),Fixed broadband subscriptions (per 100 people),Mobile cellular subscriptions (per 100 people),Individuals using the Internet (% of population)
0,Afghanistan,1960,0.089302,,0.0,
1,Afghanistan,1961,0.085584,,,
2,Afghanistan,1962,0.085584,,,
3,Afghanistan,1963,0.085584,,,
4,Afghanistan,1964,0.085584,,,


There is also some data which we would like to remove and edit. 

**(1) Year names**: Having a variable named as a number can cause problems with some functions in python. So, we will alter all dates from '[year]' to 'a[year]'

**(2) Year range**: Our technology consumption's range is only 1960-2021, however, chicken consumption is from 1961-2022. Therefore, 1960 should be removed from technology. 

**(3) Data parameters**: We are interested in different regions on consumption, not individual countries. We will be removing all entities that are counties. This can be done, as we see a 'code' is only for country codes, and black for all others.

**(4) Missing Values**: In this dataset, there are a few missing values. We should therefore remove them from out dataset. ###not sure which column we want to use, so i have the code below but havnt put it in as i think the whole row is removed !!

In [None]:
#1. Renaming all year rows

#setting up new year names
year_renaming = {str(year): f"a{year}" for year in tech['Year'].unique()}

#replacing the dataset
tech['Year'] = tech['Year'].astype(str).replace(year_renaming)

tech.head(10)

Unnamed: 0,Entity,Year,Fixed telephone subscriptions (per 100 people),Fixed broadband subscriptions (per 100 people),Mobile cellular subscriptions (per 100 people),Individuals using the Internet (% of population)
0,Afghanistan,a1960,0.089302,,0.0,
1,Afghanistan,a1961,0.085584,,,
2,Afghanistan,a1962,0.085584,,,
3,Afghanistan,a1963,0.085584,,,
4,Afghanistan,a1964,0.085584,,,
5,Afghanistan,a1965,0.097228,,0.0,
6,Afghanistan,a1966,0.093408,,,
7,Afghanistan,a1967,0.093408,,,
8,Afghanistan,a1968,0.093408,,,
9,Afghanistan,a1969,0.093408,,,


In [None]:
#2. Removing all rows for the year of 1960
tech = tech[tech['Year'] != 'a1960']

In [None]:
#3. Dropping all other individual country data
# Build up a logical index I for all relevant data
I = tech.Entity.str.contains('Europe')
I |= tech.Entity.str.contains('European')
I |= tech.Entity.str.contains('World')
I |= tech.Entity.str.contains('Income')
I |= tech.Entity.str.contains('income')

####^we can delete income and others if wanted, i just have them all in the index function for now !!

# Removing all others
tech = tech.loc[I == True] 
tech.head(10)


Unnamed: 0,Entity,Year,Fixed telephone subscriptions (per 100 people),Fixed broadband subscriptions (per 100 people),Mobile cellular subscriptions (per 100 people),Individuals using the Internet (% of population)
3663,Europe and Central Asia (WB),a1961,5.450759,,,
3664,Europe and Central Asia (WB),a1962,5.444708,,,
3665,Europe and Central Asia (WB),a1963,5.439382,,,
3666,Europe and Central Asia (WB),a1964,5.434491,,,
3667,Europe and Central Asia (WB),a1965,7.095055,,0.0,
3668,Europe and Central Asia (WB),a1966,7.087559,,,
3669,Europe and Central Asia (WB),a1967,7.079868,,,
3670,Europe and Central Asia (WB),a1968,7.071545,,,
3671,Europe and Central Asia (WB),a1969,7.062228,,,
3672,Europe and Central Asia (WB),a1970,9.847099,,0.0,


In [None]:
# 4. Drop rows with missing values. Denoted na
## tech.dropna(inplace=True)
## tech.head(10)


In [None]:
#final clean 

#resetting the index
tech.reset_index(inplace = True, drop = True) # Drop old index too



In [None]:
#final check that we have all the data needed
all_groups = tech['Entity'].unique()
print(all_groups)

all_years = tech['Year'].unique()
print(all_years)

['Europe and Central Asia (WB)' 'European Union (27)'
 'High-income countries' 'Low-income countries'
 'Lower-middle-income countries' 'Middle-income countries'
 'Upper-middle-income countries' 'World']
['a1961' 'a1962' 'a1963' 'a1964' 'a1965' 'a1966' 'a1967' 'a1968' 'a1969'
 'a1970' 'a1971' 'a1972' 'a1973' 'a1974' 'a1975' 'a1976' 'a1977' 'a1978'
 'a1979' 'a1980' 'a1981' 'a1982' 'a1983' 'a1984' 'a1985' 'a1986' 'a1987'
 'a1988' 'a1989' 'a1990' 'a1991' 'a1992' 'a1993' 'a1994' 'a1995' 'a1996'
 'a1997' 'a1998' 'a1999' 'a2000' 'a2001' 'a2002' 'a2003' 'a2004' 'a2005'
 'a2006' 'a2007' 'a2008' 'a2009' 'a2010' 'a2011' 'a2012' 'a2013' 'a2014'
 'a2015' 'a2016' 'a2017' 'a2018' 'a2019' 'a2020' 'a2021']


### 2.3. <a id='toc2_3'></a>[Summary Statistics](#toc0)

giving summary statistics etc. 


* long VS wide data? not sure which one is needed?

## 3. <a id='toc3'></a>[Plotting / Correlations etc...](#toc0)

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

## 4. <a id='toc4'></a>[Conclusion](#toc0)