# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [1]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [7]:
df2 = pd.merge(cell_phones, population, how='inner', on=('geo','time'))

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [28]:
# Printing shape to confirm my answer
print ('df2 shape:', df2.shape)
print ('\n')

# Checking the head to make sure all looks right there
print (df2.head(1))
print ('\n')

# Making the next merge
df3 = pd.merge(df2, geo_country_codes[['geo', 'country']], on='geo', how='left')

# Checking answers again with head and shape
print ('df3 shape:', df3.shape)
df3.head()

df2 shape: (8590, 4)


   geo  time  cell_phones_total  population_total
0  afg  1960                0.0           8996351


df3 shape: (8590, 5)


Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [54]:
# Creating a new column with the ratio of 2 existing columns
df3['phones_per_person'] = (df3['cell_phones_total']/df3['population_total'])

# Creating a new df with only USA data to confirm my answer
usa = df3.loc[df3['country'] == 'United States', ['country', 'time', 'phones_per_person']]
usa2017 = usa.loc[usa['time'] == 2017]
print (usa2017)

            country  time  phones_per_person
8134  United States  2017           1.220125


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [40]:
# Applying the uppercase method on the column 'geo'
print (df3['geo'].head())
df3['geo'] = df3['geo'].str.upper()
print (df3['geo'].head())

0    afg
1    afg
2    afg
3    afg
4    afg
Name: geo, dtype: object
0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [59]:
# Printing the datatypes of each column to know confirm my anticipated outputs
print ('datatypes for df3 columns:', '\n', '\n', df3.dtypes)
print ('\n')

# Creating separate dfs for numeric and non-numeric columns for clarity
numeric_features = df3[['time', 'cell_phones_total', 'population_total', 'phones_per_person']]
non_numeric_features = df3[['geo', 'country']]

# Printing answers 
print (numeric_features.describe())
print ('\n')
print (non_numeric_features.describe())
print ('\n')

datatypes for df3 columns: 
 
 geo                   object
time                   int64
cell_phones_total    float64
population_total       int64
country               object
phones_per_person    float64
dtype: object


              time  cell_phones_total  population_total  phones_per_person
count  8590.000000       8.590000e+03      8.590000e+03        8590.000000
mean   1994.193481       9.004950e+06      2.983823e+07           0.279639
std      14.257975       5.573408e+07      1.161284e+08           0.454247
min    1960.000000       0.000000e+00      4.433000e+03           0.000000
25%    1983.000000       0.000000e+00      1.456148e+06           0.000000
50%    1995.000000       6.200000e+03      5.725062e+06           0.001564
75%    2006.000000       1.697652e+06      1.810581e+07           0.461149
max    2017.000000       1.474097e+09      1.409517e+09           2.490243


         geo country
count   8590    8590
unique   195     195
top      FJI   India
freq      46      

In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [60]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [88]:
# Creating a df with only 2017 as the time as it is the only relevant time for this question
df2017 = df3.loc[df3['time'] == 2017]

# Here I am actually creating some more time-realted dfs for the bonus at the end
phone_lovers = df3.loc[df3['time'] >= 2007]
phone_loving_places = ('China', 'India', 'Indonesia', 'United States', 'Brazil')
phone_lovers = phone_lovers.loc[phone_lovers['country'].isin (phone_loving_places)]

# Sorting the df by the phone totals and checking my answer
df2017_sorted = df2017.sort_values(by=['cell_phones_total'], ascending=False)
df2017_sorted.head(5)
#phone_lovers.country.nunique() gives 5 (number of countries needed)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,phones_per_person
1496,CHN,2017,1474097000.0,1409517397,China,1.0458168186766978
3595,IND,2017,1168902277.0,1339180127,India,0.8728491809526382
3549,IDN,2017,458923202.0,263991379,Indonesia,1.738402230172827
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
1084,BRA,2017,236488548.0,209288278,Brazil,1.1299655683535224


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [78]:
# Here I am defining a new df of years where the US had a phone to person ratio > 1
# ...which would obviously mean more phones than people
usa_more_phones = usa.loc[usa['phones_per_person'] > 1]
print ('These are all years where USA phone count has exceeded population:')
print (usa_more_phones)
print ('\n')
print ('This would be the first year it happened:')

# Here I am finding the first year it happened by finding the least numbered year
# where phone to person ratio is still > 1
print (usa_more_phones.loc[usa_more_phones['time'] == usa_more_phones['time'].min()])
print ('\n')

# This is unneccesary, but I like to visually check when possible
print ('Here is the whole subet of USA data to quickly, visually confirm my answer:')
print ('\n')
print (usa)

These are all years where USA phone count has exceeded population:
            country  time  phones_per_person
8131  United States  2014  1.118914031833164
8132  United States  2015 1.1949739048796058
8133  United States  2016  1.228758722948959
8134  United States  2017 1.2201246847283354


This would be the first year it happened:
            country  time  phones_per_person
8131  United States  2014  1.118914031833164


Here is the whole subet of USA data to quickly, visually confirm my answer:


            country  time     phones_per_person
8092  United States  1960                   0.0
8093  United States  1965                   0.0
8094  United States  1970                   0.0
8095  United States  1975                   0.0
8096  United States  1976                   0.0
8097  United States  1977                   0.0
8098  United States  1978                   0.0
8099  United States  1979                   0.0
8100  United States  1980                   0.0
8101  United S

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [93]:
# Pretty straight-forward pivot table creation and then printing to check answer
phone_lovers_pivot = phone_lovers.pivot_table(index='country', columns='time',
                                             values='cell_phones_total')

print ('Pivot table shape:', phone_lovers_pivot.shape)

Pivot table shape: (5, 11)


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [137]:
# Defining phone totals for the relevant countries for the bottom of the relevant year range
phone_lovers_since_07 = phone_lovers.loc[phone_lovers['time'] == 2007]

# Defining phone totals for the relevant countries for the top end of the relevant year range
recent_phone_lovers = phone_lovers.loc[phone_lovers['time'] == 2017]

# Creating the df for my 10 year increase
more_phone_lovers_than_ever = phone_lovers_since_07

# Adjusting the data a bit
more_phone_lovers_than_ever = more_phone_lovers_than_ever.sort_values(by='country').reset_index()
phone_lovers_since_07 = phone_lovers_since_07.sort_values(by='country').reset_index()
recent_phone_lovers = recent_phone_lovers.sort_values(by='country').reset_index()

# Creating the column with the 10-yr increase 
more_phone_lovers_than_ever['10-yr nom. increase'] = (recent_phone_lovers['cell_phones_total'] - 
                                                      phone_lovers_since_07['cell_phones_total'])
# Sorting as specified
more_phone_lovers_than_ever = more_phone_lovers_than_ever.sort_values(
    by='10-yr nom. increase', ascending=False)
more_phone_lovers_than_ever = more_phone_lovers_than_ever.reset_index()

# Printing to check answers
print (more_phone_lovers_than_ever.head())
print ('\n')
print ('The country that gained 935,282,277 phones is', more_phone_lovers_than_ever.at[0,'country'])


# What Questions can be answered: a lot. One that comes to mind might be emerging market population growth as it
# relates to phone count growth. This could be pretty important for a company like Apple who just got done saying
# trade tensions and EM slowdown have hurt their growth in China

   level_0  index  geo  time  cell_phones_total  population_total  \
0        2   3585  IND  2007      233,620,000.0        1179681239   
1        1   1486  CHN  2007      547,306,000.0        1336800506   
2        3   3539  IDN  2007       93,386,881.0         232989141   
3        4   8124  USA  2007      249,300,000.0         300595175   
4        0   1074  BRA  2007      120,980,103.0         191026637   

         country   phones_per_person  10-yr nom. increase  
0          India 0.19803654773558707        935,282,277.0  
1          China   0.409414865975522        926,791,000.0  
2      Indonesia 0.40082074468869777        365,536,321.0  
3  United States  0.8293546295279024        146,581,000.0  
4         Brazil  0.6333153580042348        115,508,445.0  


The country that gained 935,282,277 phones is India


If you have the time and curiosity, what other questions can you ask and answer with this data?