<a href="https://colab.research.google.com/github/tesseract314/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [3]:
# Quick glance at cell_phones dataframe
cell_phones.head(), cell_phones.shape

(   geo  time  cell_phones_total
 0  abw  1960                0.0
 1  abw  1965                0.0
 2  abw  1970                0.0
 3  abw  1975                0.0
 4  abw  1976                0.0, (9215, 3))

In [4]:
# Quick glance at population dataframe
population.head(), population.shape

(   geo  time  population_total
 0  afg  1800           3280000
 1  afg  1801           3280000
 2  afg  1802           3280000
 3  afg  1803           3280000
 4  afg  1804           3280000, (59297, 3))

In [0]:
# Merging cell_phones and population dataframes
df = pd.merge(cell_phones, population, how='inner', on=['geo', 'time'])

In [6]:
# Looking at new column
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [7]:
# Checking if the shape is correct --  it is
df.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [0]:
# Merging geo_country_codes with the dataframe created above
df = pd.merge(geo_country_codes[['geo', 'country']], df)

In [10]:
# Seeing the new column
df.head()

Unnamed: 0,geo,country,time,cell_phones_total,population_total
0,afg,Afghanistan,1960,0.0,8996351
1,afg,Afghanistan,1965,0.0,9938414
2,afg,Afghanistan,1970,0.0,11126123
3,afg,Afghanistan,1975,0.0,12590286
4,afg,Afghanistan,1976,0.0,12840299


In [9]:
# Checking if the shape is correct --  it is
df.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
# Creating new column by dividing cell_phones total by population_total
df['cell_phone_per_person'] = df['cell_phones_total'] / df['population_total']

In [45]:
# Sorting by time and country to see if USA has the correct ratio for 2017 --  it does
df.sort_values(by=['time', 'country'], ascending=False).head(10)

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell_phone_per_person
8589,ZWE,Zimbabwe,2017,14092104.0,16529904,0.8525218295278666
8543,ZMB,Zambia,2017,13438539.0,17094130,0.786149338983616
8317,VNM,Vietnam,2017,120016181.0,95540800,1.2561772666755984
8271,VEN,Venezuela,2017,24493687.0,31977065,0.7659767086191306
8363,VUT,Vanuatu,2017,228016.0,276244,0.8254152126381026
8180,UZB,Uzbekistan,2017,24265460.0,31910641,0.7604190714940512
8091,URY,Uruguay,2017,5097569.0,3456750,1.474671006002748
8134,USA,United States,2017,395881000.0,324459463,1.2201246847283354
2817,GBR,United Kingdom,2017,79173658.0,66181585,1.196309486997025
219,ARE,United Arab Emirates,2017,19826224.0,9400145,2.109140231347495


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
# Changing geo column to all caps
df['geo'] = df['geo'].str.upper()

In [23]:
# Seeing that the change was made
df.head()

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell_phone_per_person
0,AFG,Afghanistan,1960,0.0,8996351,0.0
1,AFG,Afghanistan,1965,0.0,9938414,0.0
2,AFG,Afghanistan,1970,0.0,11126123,0.0
3,AFG,Afghanistan,1975,0.0,12590286,0.0
4,AFG,Afghanistan,1976,0.0,12840299,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [24]:
# Describing only numeric columns -- time period range is correct
df.describe()

Unnamed: 0,time,cell_phones_total,population_total,cell_phone_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [25]:
# Describing only non-numeric columns -- unique country count is correct
import numpy as np
df.describe(exclude=np.number)

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,TGO,El Salvador
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [0]:
# Creating a condition where time must equal 2017
condition = (df['time'] == 2017)

# Making subset dataframe with the condition
subset = df[condition]

In [0]:
# Making top_five dataframe by sorting the previous subset by cell_phones_total
top_five = subset.sort_values(by='cell_phones_total', ascending=False).head()

In [60]:
# Seeing if top_five dataframe matches the example above -- it does
top_five[['country', 'cell_phones_total']].head()

Unnamed: 0,country,cell_phones_total
1496,China,1474097000.0
3595,India,1168902277.0
3549,Indonesia,458923202.0
8134,United States,395881000.0
1084,Brazil,236488548.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [0]:
# Creating condition where country must equal 'United States'
condition2 = df['country'] == 'United States'

# Creating a subset dataframe based on the above condition
subset2 = df[condition2]

In [44]:
subset2.sort_values(by='cell_phone_per_person', ascending=False).head(10)

# The first year the USA had more cell phones than people was 2014
# as that is the first time the cell_phone_per_person ratio goes above 1

Unnamed: 0,geo,country,time,cell_phones_total,population_total,cell_phone_per_person
8133,USA,United States,2016,395881000.0,322179605,1.228758722948959
8134,USA,United States,2017,395881000.0,324459463,1.2201246847283354
8132,USA,United States,2015,382307000.0,319929162,1.1949739048796058
8131,USA,United States,2014,355500000.0,317718779,1.118914031833164
8130,USA,United States,2013,310698000.0,315536676,0.9846652501340288
8129,USA,United States,2012,304838000.0,313335423,0.9728807457559626
8128,USA,United States,2011,297404000.0,311051373,0.9561250192584748
8127,USA,United States,2010,285118000.0,308641391,0.9237840688710478
8126,USA,United States,2009,274283000.0,306076362,0.8961260458264333
8125,USA,United States,2008,261300000.0,303374067,0.8613129084629373


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
# Creating a condition where time must be equal to or greater than 2007
# and country must be from the list of 5 countries above
condition3 = ((df['time'] >= 2007) & ((df['country'] == 'China') | 
                                      (df['country'] == 'India') |
                                      (df['country'] == 'United States') | 
                                      (df['country'] == 'Indonesia') | 
                                      (df['country'] == 'Brazil')))

In [0]:
# Creating subset dataframe with the above subset
subset3 = df[condition3]

In [53]:
# Making sure there are no unwanted countries in dataframe
subset3['country'].value_counts()

Indonesia        11
China            11
United States    11
Brazil           11
India            11
Name: country, dtype: int64

In [54]:
# Making sure there are no unwanted years in dataframe
subset3['time'].value_counts()

2017    5
2016    5
2015    5
2014    5
2013    5
2012    5
2011    5
2010    5
2009    5
2008    5
2007    5
Name: time, dtype: int64

In [56]:
# Creating pivot table with the subset
pivot = subset3.pivot(index='country', columns='time', values='cell_phones_total')
pivot

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


In [57]:
# Checking to make sure the pivot table is the right shape -- it is
pivot.shape

(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [0]:
# Creating a column that shows the total increase in cell phones from 2007 to 2017
pivot['cell_phone_incr_2007_to_2017'] = pivot[2017] - pivot[2007]

In [69]:
# Seeing that the new column is on the dataframe
pivot.head()

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,cell_phone_incr_2007_to_2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0,115508445.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0,926791000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0,935282277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0,365536321.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0,146581000.0


In [72]:
pivot.sort_values(by='cell_phone_incr_2007_to_2017', ascending=False)

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,cell_phone_incr_2007_to_2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0,935282277.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0,926791000.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0,365536321.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0,146581000.0
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0,115508445.0


In [0]:
# The country with the largest increase in total cell phones (935,282,277) from 2007 to 2017 is India

If you have the time and curiosity, what other questions can you ask and answer with this data?

In [0]:
pivot['cell_phone_pct_chg_2007_2017'] = pivot['cell_phone_incr_2007_to_2017'] / pivot[2007]

In [74]:
pivot

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,cell_phone_incr_2007_to_2017,cell_phone_pct_chg_2007_2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0,115508445.0,0.954772248788712
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0,926791000.0,1.693368974577293
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0,935282277.0,4.003434110949405
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0,365536321.0,3.9142148991998136
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0,146581000.0,0.5879703168872844


In [0]:
# India had the biggest percent change in total cell phones from 2007 to 2017 as well, with a 400% increase
# The US had the smallest percent change at about 59%