<a href="https://colab.research.google.com/github/will-cotton4/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/(Will_Cotton)_DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

In [0]:
geo_country_codes.columns

Index(['geo', 'alt_5', 'alternative_1', 'alternative_2', 'alternative_3',
       'alternative_4_cdiac', 'arb1', 'arb2', 'arb3', 'arb4', 'arb5', 'arb6',
       'g77_and_oecd_countries', 'gapminder_list', 'god_id', 'gwid',
       'income_groups', 'is--country', 'iso3166_1_alpha2', 'iso3166_1_alpha3',
       'iso3166_1_numeric', 'iso3166_2', 'landlocked', 'latitude', 'longitude',
       'main_religion_2008', 'country', 'pandg', 'un_state',
       'unicode_region_subtag', 'upper_case_name', 'world_4region',
       'world_6region'],
      dtype='object')

In [0]:
cell_phones.head()

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [0]:
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [0]:
geo_country_codes.head()

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,...,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,...,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america


## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [0]:
cell_user_data = pd.merge(cell_phones, population, how='inner')
cell_user_data.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [0]:
cell_user_data = pd.merge(cell_user_data,
                          geo_country_codes[['geo','country']],
                          how = 'inner')
cell_user_data.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
cell_user_data['cells_per_person'] = cell_user_data['cell_phones_total']/cell_user_data['population_total']

In [0]:
cell_user_data[cell_user_data['country']=='United States']

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cells_per_person
8092,usa,1960,0.0,186808228,United States,0.0
8093,usa,1965,0.0,199815540,United States,0.0
8094,usa,1970,0.0,209588150,United States,0.0
8095,usa,1975,0.0,219205296,United States,0.0
8096,usa,1976,0.0,221239215,United States,0.0
8097,usa,1977,0.0,223324042,United States,0.0
8098,usa,1978,0.0,225449657,United States,0.0
8099,usa,1979,0.0,227599878,United States,0.0
8100,usa,1980,0.0,229763052,United States,0.0
8101,usa,1984,91600.0,238573861,United States,0.000384


** Looks good. **


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
cell_user_data['geo'] = cell_user_data['geo'].apply(str.upper)
cell_user_data['geo'].head()

0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object

**Nice.**

## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [0]:
import numpy as np
cell_user_data.describe(include=[np.number])

Unnamed: 0,time,cell_phones_total,population_total,cells_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [0]:
cell_user_data.describe(include=[np.object])

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,MNG,Cambodia
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [0]:
cell_groups = cell_user_data[['country', 'cell_phones_total', 'time']][cell_user_data['time']==2017].groupby(by='country')

cell_groups.sum().reset_index().sort_values('cell_phones_total', ascending=False).head()

Unnamed: 0,country,cell_phones_total,time
31,China,1474097000.0,2017
67,India,1168902277.0,2017
68,Indonesia,458923202.0,2017
160,United States,395881000.0,2017
21,Brazil,236488548.0,2017


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

**First, make a dataframe containing only the times where number of cell phones exceeds people:

In [0]:
condition = cell_user_data['cell_phones_total'] > cell_user_data['population_total']
more_phones_than_people = cell_user_data[condition]

In [0]:
more_phones_than_people.shape

(1044, 6)

In [0]:
more_phones_than_people.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cells_per_person
131,ALB,2011,3100000.0,2926659,Albania,1.0592282872722787
132,ALB,2012,3500000.0,2920039,Albania,1.198614128098974
133,ALB,2013,3685983.0,2918978,Albania,1.2627649129250031
134,ALB,2014,3359654.0,2920775,Albania,1.150261146442297
135,ALB,2015,3400955.0,2923352,Albania,1.1633751255408176


**Now we slice it to see just the US and sort by time:**

In [0]:
more_phones_than_people[more_phones_than_people['country']=='United States'].sort_values(by='time').head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cells_per_person
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


** Looks like the first year was 2014. We could have also checked to see where `cells_per_person` exceeded one. As a sanity check, we'll take a look at the original DataFrame from 2000 to 2017 and see if that info matches this chart.**

In [0]:
us_after_2000 = (cell_user_data['country']=='United States') & (cell_user_data['time']>= 2000)
cell_user_data[us_after_2000].head(20)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cells_per_person
8117,USA,2000,109478031.0,281982778,United States,0.3882436784845066
8118,USA,2001,128500000.0,284852391,United States,0.4511108351553208
8119,USA,2002,141800000.0,287506847,United States,0.4932056452902494
8120,USA,2003,160637000.0,290027624,United States,0.5538679308699229
8121,USA,2004,184819000.0,292539324,United States,0.6317748925952943
8122,USA,2005,203700000.0,295129501,United States,0.6902054837276331
8123,USA,2006,229600000.0,297827356,United States,0.7709164231374367
8124,USA,2007,249300000.0,300595175,United States,0.8293546295279024
8125,USA,2008,261300000.0,303374067,United States,0.8613129084629373
8126,USA,2009,274283000.0,306076362,United States,0.8961260458264333


**Cool beans.**

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
countries = ['China', 'India', 'United States', 'Indonesia', 'Brazil']
time_constraint = (cell_user_data['time'] >= 2007) & (cell_user_data['time'] <= 2017)
countries_in_time_range = (cell_user_data['country'].isin(countries))  & time_constraint
requested_pivot = pd.pivot_table(cell_user_data[countries_in_time_range], 
                                 values = 'cell_phones_total', 
                                 columns = 'time',
                                 index = ['country'])
requested_pivot.shape

(5, 11)

In [0]:
requested_pivot

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


**Nice.**

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

**Not too bad. We'll just add a new feature: namely, the increase in cell phones from 2007 to 2017, then query the result to see when that number was 935,282,277.**

In [0]:
requested_pivot['increase'] = requested_pivot[2017]-requested_pivot[2007]
requested_pivot[requested_pivot['increase']==935282277]

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,increase
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0,935282277.0


**Looks like it was India.**

If you have the time and curiosity, what other questions can you ask and answer with this data?

# Extra Stuff:

**I'm interested in slicing this data by region, so I'll append a new column from the `geo_country_codes' DataFrame:**

In [0]:
geo_country_codes.head(10)

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,...,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,...,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america
5,akr_a_dhe,,,,,,,,,,...,,,,Akrotiri and Dhekelia,,False,,,europe,europe_central_asia
6,ala,,√Öland,,,,,,,,...,60.25,20.0,,Åland,,False,AX,AALAND ISLANDS,europe,europe_central_asia
7,alb,,,,,Albania,,,,,...,41.0,20.0,muslim,Albania,ALBANIA,True,AL,ALBANIA,europe,europe_central_asia
8,and,,,,,,,,,,...,42.50779,1.52109,christian,Andorra,,True,AD,ANDORRA,europe,europe_central_asia
9,ant,,Neth. Antilles,,,Netherland Antilles,,,,,...,,,,Netherlands Antilles,,False,,NETHERLANDS ANTILLES,americas,america


In [0]:
cell_user_data = pd.merge(cell_user_data, geo_country_codes[['world_6region', 'country']])
cell_user_data = cell_user_data.rename(columns={'world_6region':'region'})

In [0]:
region_groups = cell_user_data.groupby('region')

In [0]:
region_groups.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cells_per_person,region
0,AFG,1960,0.0,8996351,Afghanistan,0.0,south_asia
1,AFG,1965,0.0,9938414,Afghanistan,0.0,south_asia
2,AFG,1970,0.0,11126123,Afghanistan,0.0,south_asia
3,AFG,1975,0.0,12590286,Afghanistan,0.0,south_asia
4,AFG,1976,0.0,12840299,Afghanistan,0.0,south_asia
46,AGO,1960,0.0,5643182,Angola,0.0,sub_saharan_africa
47,AGO,1965,0.0,6203299,Angola,0.0,sub_saharan_africa
48,AGO,1970,0.0,6776381,Angola,0.0,sub_saharan_africa
49,AGO,1975,0.0,7682479,Angola,0.0,sub_saharan_africa
50,AGO,1976,0.0,7900997,Angola,0.0,sub_saharan_africa


**Now we'll do a pivot table to see the largest cell phone counts by region:**

In [0]:
time_constraint = cell_user_data['time'] >= 2010
region_cells = pd.pivot_table(cell_user_data[time_constraint],
                              values='cell_phones_total', 
                              index = ['region', 'country'],
                              columns = ['time'])
region_cells

Unnamed: 0_level_0,time,2010,2011,2012,2013,2014,2015,2016,2017
region,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
america,Antigua and Barbuda,167970.0,176008.0,127381.0,114358.0,120041.0,176000.0,180000.0,
america,Argentina,57082298.0,60722729.0,64327647.0,67361515.0,61234216.0,61842011.0,63723692.0,61897379.0
america,Bahamas,428377.0,298790.0,300000.0,287000.0,314842.0,311175.0,360200.0,353540.0
america,Barbados,350061.0,347917.0,349296.0,307708.0,305456.0,334792.0,332208.0,337791.0
america,Belize,194201.0,222407.0,172423.0,174615.0,172300.0,211946.0,227000.0,
america,Bolivia,7179293.0,8353273.0,9493207.0,10425704.0,10450341.0,10162829.0,10106216.0,10963224.0
america,Brazil,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
america,Canada,25825400.0,26840000.0,27720000.0,28360000.0,28789000.0,29765000.0,30752000.0,31458600.0
america,Chile,19852242.0,22315248.0,23940973.0,23661339.0,23680718.0,23206353.0,23302603.0,23013147.0
america,Colombia,44477653.0,46200421.0,49066359.0,50295114.0,55330272.0,57327470.0,58684924.0,62222011.0


**Neat.**

##I tried `matplotlib` below, with predictable results.

** Now we'll plot all the population growth by country, colored by region: **

In [0]:
import matplotlib.pyplot as plt
unpacked.head()

Unnamed: 0_level_0,region,america,east_asia_pacific,europe_central_asia,middle_east_north_africa,south_asia,sub_saharan_africa
time,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010,Afghanistan,,,,,10215840.0,
2010,Albania,,,2692372.0,,,
2010,Algeria,,,,32780165.0,,
2010,Andorra,,,65495.0,,,
2010,Angola,,,,,,9403365.0


In [0]:
def pop_plot(country, region):
    data = unpacked.loc[country, region]
    plt.plot(data.index, data.values)

In [0]:
for region in unpacked['region']:
  for country in unpacked['country']:
    pop_plot(country, region)

KeyError: ignored

## 