<a href="https://colab.research.google.com/github/biovir3/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [0]:
print('This is what the Join of cell_phone, and population inner joined with geo, and time looks like')
cell_phone_population = pd.merge(cell_phones, population, on=['geo', 'time'])
cell_phone_population.head()


This is what the Join of cell_phone, and population inner joined with geo, and time looks like


Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [0]:
print('The Shape of the previous example is shown Below')
cell_phone_population.shape

The Shape of the previous example is shown Below


(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [0]:
geo_codes_cpp = pd.merge(cell_phone_population,geo_country_codes[['geo', 'country']], on='geo')

In [0]:
print('The Merge took place in the previous code section, and the shape is below')
geo_codes_cpp.shape

The Merge took place in the previous code section, and the shape is below


(8590, 6)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
temp = pd.DataFrame()

In [0]:
#The line of code below calculates the # of cell phones per person and adds it to the DataFrame
geo_codes_cpp['cell_per_person'] = geo_codes_cpp['cell_phones_total'] / geo_codes_cpp['population_total']

In [0]:
print('After adding the number of cell phones per person, I have extracted the USA Row for 2017')
print('The result is below, and it shows the expected value.')
geo_codes_cpp.loc[geo_codes_cpp['time'] == 2017].loc[geo_codes_cpp['geo'] == 'USA']

After adding the number of cell phones per person, I have extracted the USA Row for 2017
The result is below, and it shows the expected value.


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_per_person
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
#The line of code below takes every 3 letter country code and changes them to upper case
geo_codes_cpp['geo'] = geo_codes_cpp['geo'].str.upper()

In [0]:
print('The selection below shows that the geo codes in the geo column have been uppercased')
geo_codes_cpp['geo'].head(5)

0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object

## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [0]:
print(geo_codes_cpp.describe())
#By Default the describe method only displays numeric data.
print('The default for Describe, is to only show numeric columns as shown above')

                     time    cell_phones_total     population_total  \
count             8,590.0              8,590.0              8,590.0   
mean  1,994.1934807916182  9,004,949.642905472 29,838,230.581722934   
std    14.257974607310302 55,734,084.872179635 116,128,377.47477299   
min               1,960.0                  0.0              4,433.0   
25%               1,983.0                  0.0          1,456,148.0   
50%               1,995.0              6,200.0          5,725,062.5   
75%               2,006.0          1,697,652.0         18,105,812.0   
max               2,017.0      1,474,097,000.0      1,409,517,397.0   

            cell_per_person  
count               8,590.0  
mean     0.2796385558059151  
std       0.454246656214052  
min                     0.0  
25%                     0.0  
50%   0.0015636266438163813  
75%      0.4611491855201403  
max       2.490242818521353  
The default for Describe, is to only show numeric columns as shown above


In [0]:
import numpy as np
print('Below I am describing all of the non numeric values from the geo codes DataFrame')
geo_codes_cpp.describe(exclude=np.number)

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,KAZ,Poland
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [0]:
geo_sort_by_cell_phone_total_2017 = pd.DataFrame()
geo_sort_by_cell_phone_total_2017 = geo_codes_cpp[['country', 'cell_phones_total', 'time']]


In [0]:
print('Below I am sorting the data by the cell_phones_total row, and then selecting out the top 5 largest values from that column')
geo_sort_by_cell_phone_total_2017.sort_values('cell_phones_total', ascending=False, axis=0).loc[geo_sort_by_cell_phone_total_2017['time'] == 2017].head(5)

Below I am sorting the data by the cell_phones_total row, and then selecting out the top 5 largest values from that column


Unnamed: 0,country,cell_phones_total,time
1496,China,1474097000.0,2017
3595,India,1168902277.0,2017
3549,Indonesia,458923202.0,2017
8134,United States,395881000.0,2017
1084,Brazil,236488548.0,2017


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [0]:
print('Here I have taken the full dataframe, and selected out the United States, and then picked the rows')
print('where the number of cellphones are greater than the population')
geo_codes_cpp.loc[geo_codes_cpp['country'] == 'United States'].loc[geo_codes_cpp['cell_phones_total'] > geo_codes_cpp['population_total']]
# 2014 was the first year where the USA had more Cell phones than People.

Here I have taken the full dataframe, and selected out the United States, and then picked the rows
where the number of cellphones are greater than the population


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_per_person
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
#China = CHN
#United States = USA
#India = IND
#Indonesia = IDN
#Brazil = BRA

In [0]:
geo_codes_cpp_sel_years = pd.DataFrame()
geo_codes_cpp_sel_years = geo_codes_cpp.loc[geo_codes_cpp['time'] <= 2017].loc[geo_codes_cpp['time'] >= 2007].loc[(geo_codes_cpp['geo'] == 'CHN') | (geo_codes_cpp['geo'] == 'USA') | (geo_codes_cpp['geo'] == 'IND') | (geo_codes_cpp['geo'] == 'IDN') | (geo_codes_cpp['geo'] == 'BRA') ]

In [0]:
print('In the code section above, I pulled out all of the data that matched the required metrics and copied them into a new dataframe in order to only have the data required for the pivot table')
geo_codes_cpp_sel_years.head()

In the code section above, I pulled out all of the data that matched the required metrics and copied them into a new dataframe in order to only have the data required for the pivot table


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_per_person
1074,BRA,2007,120980103.0,191026637,Brazil,0.6333153580042348
1075,BRA,2008,150641403.0,192979029,Brazil,0.7806102237150339
1076,BRA,2009,169385584.0,194895996,Brazil,0.869107562373934
1077,BRA,2010,196929978.0,196796269,Brazil,1.000679428531239
1078,BRA,2011,234357507.0,198686688,Brazil,1.1795330092774006


In [0]:
print('Below, I have setup the pivot table as required.')
pd.pivot_table(geo_codes_cpp_sel_years, index=['geo'], values = ['cell_phones_total'], columns = ['time'])

Below, I have setup the pivot table as required.


Unnamed: 0_level_0,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total
time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
geo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
BRA,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
CHN,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
IDN,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
IND,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
USA,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [0]:
# new Column Largest Increast from 2007 to 2017 = 2017 val - 2007 val
geo_cell_increase = pd.DataFrame()
geo_cell_increase['Increase'] = geo_codes_cpp_sel_years['cell_phones_total'].loc[geo_codes_cpp_sel_years['time'] == '2017'] geo_codes_cpp_sel_years['cell_phones_total'].loc[geo_codes_cpp_sel_years['time'] == '2007']

In [0]:
first = pd.DataFrame()
second = pd.DataFrame()


In [0]:
first['first'] = (geo_codes_cpp_sel_years['cell_phones_total'].loc[geo_codes_cpp_sel_years['time'] == 2017])
second['second'] = (geo_codes_cpp_sel_years['cell_phones_total'].loc[geo_codes_cpp_sel_years['time'] == 2007])
#print((geo_codes_cpp_sel_years['cell_phones_total'].loc[geo_codes_cpp_sel_years['time'] == 2017]) - (geo_codes_cpp_sel_years['cell_phones_total'].loc[geo_codes_cpp_sel_years['time'] == 2007]))

In [0]:
#print(first)
#print(second)
diff = []
print(first['first'].iloc[0] - second['second'].iloc[0])
for x in range(0,5):
    diff.append(first['first'].iloc[x] - second['second'].iloc[x])
    

115508445.0


In [0]:
print(diff)

[115508445.0, 926791000.0, 365536321.0, 935282277.0, 146581000.0]


***After Pandas Dataframe subtraction methods have failed me, and I had to do further digging,
I can see that the Country that had and increase of 935,282,277 cellphones from 2007 to 2017 is India.***

If you have the time and curiosity, what other questions can you ask and answer with this data?