<a href="https://colab.research.google.com/github/tbradshaw91/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

**Looking at cell shape and population shape**

In [5]:
cell_phones.shape, population.shape

((9215, 3), (59297, 3))

**Checking out the Cell Columns to see which columns will work**

In [8]:
cell_phones.columns # I want geo and time

Index(['geo', 'time', 'cell_phones_total'], dtype='object')

**Checking out the Population Columns to see which columns will work**

In [9]:
population.columns # I want geo and time

Index(['geo', 'time', 'population_total'], dtype='object')

**Merging the Population and Cell Columns on 'geo' and 'time'**

In [0]:
population_and_cell_columns = ['geo', 'time']
cell_and_population_merge = pd.merge(cell_phones, population, 
                  how='inner', on=population_and_cell_columns)

**Checking that we have the correct shape**

In [36]:
cell_and_population_merge.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

**Looking at the Geo Country Codes Columns**

In [38]:
geo_country_codes.columns

Index(['geo', 'alt_5', 'alternative_1', 'alternative_2', 'alternative_3',
       'alternative_4_cdiac', 'arb1', 'arb2', 'arb3', 'arb4', 'arb5', 'arb6',
       'g77_and_oecd_countries', 'gapminder_list', 'god_id', 'gwid',
       'income_groups', 'is--country', 'iso3166_1_alpha2', 'iso3166_1_alpha3',
       'iso3166_1_numeric', 'iso3166_2', 'landlocked', 'latitude', 'longitude',
       'main_religion_2008', 'country', 'pandg', 'un_state',
       'unicode_region_subtag', 'upper_case_name', 'world_4region',
       'world_6region'],
      dtype='object')

**Merging the the columns from 'geo' and 'country' to 'cell_and_population_merge'**

In [0]:
geo_country_columns = ['geo', 'country']

final_merge = pd.merge(cell_and_population_merge, geo_country_codes[geo_country_columns])


In [178]:
final_merge.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [179]:
final_merge.columns

Index(['geo', 'time', 'cell_phones_total', 'population_total', 'country'], dtype='object')

**Calculating the number of cell phones per person by dividing 'cell_phones_total' by 'population_total'**

In [0]:
final_merge['cell_phones_per_person'] = final_merge['cell_phones_total'] / final_merge['population_total']

**I got it right, shown in line 8134**

In [191]:
final_merge[ final_merge['country'].str.startswith('United States')].tail(1)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8134,usa,2017,395881000.0,324459463,United States,1.2201246847283354


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

**Converting lowecase to uppercase**

In [0]:
final_merge['geo'] = final_merge['geo'].str.upper()

**Checking it worked**

In [193]:
final_merge.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
0,AFG,1960,0.0,8996351,Afghanistan,0.0
1,AFG,1965,0.0,9938414,Afghanistan,0.0
2,AFG,1970,0.0,11126123,Afghanistan,0.0
3,AFG,1975,0.0,12590286,Afghanistan,0.0
4,AFG,1976,0.0,12840299,Afghanistan,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

**Numeric Describe:**

In [194]:
import numpy as np
final_merge.describe(include=[np.number])

Unnamed: 0,time,cell_phones_total,population_total,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.1934807916184,9004949.642905472,29838230.581722934,0.2796385558059151
std,14.257974607310302,55734084.87217964,116128377.474773,0.454246656214052
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.5,0.0015636266438163
75%,2006.0,1697652.0,18105812.0,0.4611491855201403
max,2017.0,1474097000.0,1409517397.0,2.490242818521353


**Non Numeric Describe:**

There are numbers here, but they represent codes for countries and geo

In [198]:
final_merge.describe(exclude=[np.number])

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,MCO,Guyana
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [199]:
filtering_2017 = final_merge[final_merge['time'] == 2017]
countries_most_cell_phones = filtering_2017.sort_values(by=['cell_phones_total', 'time'], ascending=False).head()
countries_most_cell_phones

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
1496,CHN,2017,1474097000.0,1409517397,China,1.0458168186766978
3595,IND,2017,1168902277.0,1339180127,India,0.8728491809526382
3549,IDN,2017,458923202.0,263991379,Indonesia,1.738402230172827
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
1084,BRA,2017,236488548.0,209288278,Brazil,1.1299655683535224


**The Top 5 Countries with the Most Cell Phones Used:
**

1.   China
2.   India
3.   Indonesia
4.   USA
5.   Brazil



2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [202]:
specifying_country = final_merge[final_merge['country'] == 'United States']

usa_cell_phones_vs_ppl = specifying_country.sort_values(by=['cell_phones_total','population_total'], ascending=False)


usa_cell_phones_vs_ppl.loc[ usa_cell_phones_vs_ppl['cell_phones_total'] > usa_cell_phones_vs_ppl['population_total']].sort_values(by=['time'], ascending=True).head(1)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164


**2014 was the first year USA had more cell phones than people**

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

**Looking at Pivot Table as is**

In [203]:
final_merge.pivot_table(index='country', columns='time', values='cell_phones_total').head() # Initial PT

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0


**Setting the new columns, rows and values**

In [0]:
years = [x for x in range (2007,2018)]
countries = ['Brazil','China','India','Indonesia','United States']
new_list = final_merge[final_merge['country'].isin(countries)].loc[final_merge['time'].isin(years)]


**It worked**

In [206]:
new_list.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
1074,BRA,2007,120980103.0,191026637,Brazil,0.6333153580042348
1075,BRA,2008,150641403.0,192979029,Brazil,0.7806102237150339
1076,BRA,2009,169385584.0,194895996,Brazil,0.869107562373934
1077,BRA,2010,196929978.0,196796269,Brazil,1.000679428531239
1078,BRA,2011,234357507.0,198686688,Brazil,1.1795330092774006


**Pivoting**

In [172]:
grand_pivot = new_list.pivot_table(index = 'country', columns = 'time', values = 'cell_phones_total')
grand_pivot.head()

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


**Confirming Shape**

In [173]:
grand_pivot.shape

(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

**Another Pivot finding the difference**

In [175]:
grand_pivot['difference'] = grand_pivot[2017]- grand_pivot[2007]
grand_pivot.head()

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,cell_difference,difference
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0,115508445.0,115508445.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0,926791000.0,926791000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0,935282277.0,935282277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0,365536321.0,365536321.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0,146581000.0,146581000.0


**Listing**

In [176]:
grand_pivot['difference'].sort_values(ascending=False)

country
India           935,282,277.0
China           926,791,000.0
Indonesia       365,536,321.0
United States   146,581,000.0
Brazil          115,508,445.0
Name: difference, dtype: float64

**India has more cell phones**

If you have the time and curiosity, what other questions can you ask and answer with this data?