<a href="https://colab.research.google.com/github/nickwinters1/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [0]:
result1 = pd.merge(cell_phones, population, how='inner', on=('geo','time'))

In [38]:
print ('result1 shape:', result1.shape)
print ('\n')
result1.head()

result1 shape: (8590, 4)




Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [39]:
result2 = pd.merge(result1, geo_country_codes[['geo', 'country']], on='geo', how='left')
print ('result2 shape:', result2.shape)
result2.head()

result2 shape: (8590, 5)


Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [45]:
result2['phones_per_person'] = (result2['cell_phones_total']/result2['population_total'])
usa_data = result2.loc[result2['country'] == 'United States', ['country', 'time', 'phones_per_person']]
usa_data_2017 = usa.loc[usa['time'] == 2017]
print (usa_data_2017)

            country  time  phones_per_person
8134  United States  2017           1.220125


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [46]:
print (result2['geo'].head())
result2['geo'] = result2['geo'].str.upper()
print (result2['geo'].head())

0    afg
1    afg
2    afg
3    afg
4    afg
Name: geo, dtype: object
0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [48]:
print ('datatypes for result2 columns:', '\n', '\n', result2.dtypes)
print ('\n')
numeric_features = result2[['time', 'cell_phones_total', 'population_total', 'phones_per_person']]
non_numeric_features = result2[['geo', 'country']]
print (numeric_features.describe())
print ('\n')
print (non_numeric_features.describe())
print ('\n')

datatypes for result2 columns: 
 
 geo                   object
time                   int64
cell_phones_total    float64
population_total       int64
country               object
phones_per_person    float64
dtype: object


              time  cell_phones_total  population_total  phones_per_person
count  8590.000000       8.590000e+03      8.590000e+03        8590.000000
mean   1994.193481       9.004950e+06      2.983823e+07           0.279639
std      14.257975       5.573408e+07      1.161284e+08           0.454247
min    1960.000000       0.000000e+00      4.433000e+03           0.000000
25%    1983.000000       0.000000e+00      1.456148e+06           0.000000
50%    1995.000000       6.200000e+03      5.725062e+06           0.001564
75%    2006.000000       1.697652e+06      1.810581e+07           0.461149
max    2017.000000       1.474097e+09      1.409517e+09           2.490243


         geo  country
count   8590     8590
unique   195      195
top      SDN  Jamaica
freq      

In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [53]:
result_2017 = result2.loc[result2['time'] == 2017]
most_phones = result2.loc[result2['time'] >= 2007]
most_phones_in_country = ('China', 'India', 'Indonesia', 'United States', 'Brazil')
most_phones = most_phones.loc[most_phones['country'].isin (most_phones_in_country)]
result_2017_sorted = result_2017.sort_values(by=['cell_phones_total'], ascending=False)
result_2017_sorted.head(5)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,phones_per_person
1496,CHN,2017,1474097000.0,1409517397,China,1.0458168186766978
3595,IND,2017,1168902277.0,1339180127,India,0.8728491809526382
3549,IDN,2017,458923202.0,263991379,Indonesia,1.738402230172827
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
1084,BRA,2017,236488548.0,209288278,Brazil,1.1299655683535224


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [55]:
usa_phones_exceeded_population = usa.loc[usa['phones_per_person'] > 1]
print ('All years phones exceeded population:')
print (usa_phones_exceeded_population)
print ('\n')
print ('First year phones exceeded population:')
print (usa_phones_exceeded_population.loc[usa_phones_exceeded_population['time'] == usa_phones_exceeded_population['time'].min()])
print ('\n')

All years phones exceeded population:
            country  time  phones_per_person
8131  United States  2014  1.118914031833164
8132  United States  2015 1.1949739048796058
8133  United States  2016  1.228758722948959
8134  United States  2017 1.2201246847283354


First year phones exceeded population:
            country  time  phones_per_person
8131  United States  2014  1.118914031833164




## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [59]:
most_phones_pivot = most_phones.pivot_table(index='country', columns='time',
                                             values='cell_phones_total')
print ('Pivot Table Shape:', most_phones_pivot.shape)
most_phones_pivot.head()


Pivot Table Shape: (5, 11)


time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [64]:
most_phones_since_07 = most_phones.loc[most_phones['time'] == 2007]
most_phones_recent = most_phones.loc[most_phones['time'] == 2017]
most_phones_increase_2007_to_2017 = most_phones_since_07
most_phones_increase_2007_to_2017 = most_phones_increase_2007_to_2017.sort_values(by='country').reset_index()
most_phones_since_07 = most_phones_since_07.sort_values(by='country').reset_index()
most_phones_recent = most_phones_recent.sort_values(by='country').reset_index()
most_phones_increase_2007_to_2017['10 year phone increase'] = (most_phones_recent['cell_phones_total'] - 
                                                      most_phones_since_07['cell_phones_total'])
most_phones_increase_2007_to_2017 = most_phones_increase_2007_to_2017.sort_values(
    by='10 year phone increase', ascending=False)
most_phones_increase_2007_to_2017 = most_phones_increase_2007_to_2017.reset_index()
print (most_phones_increase_2007_to_2017.head())
print ('\n')
print ('The country that had 935,282,277 more cell phones in 2017 is', most_phones_increase_2007_to_2017.at[0,'country'])

   level_0  index  geo  time  cell_phones_total  population_total  \
0        2   3585  IND  2007      233,620,000.0        1179681239   
1        1   1486  CHN  2007      547,306,000.0        1336800506   
2        3   3539  IDN  2007       93,386,881.0         232989141   
3        4   8124  USA  2007      249,300,000.0         300595175   
4        0   1074  BRA  2007      120,980,103.0         191026637   

         country   phones_per_person  10 year phone increase  
0          India 0.19803654773558707           935,282,277.0  
1          China   0.409414865975522           926,791,000.0  
2      Indonesia 0.40082074468869777           365,536,321.0  
3  United States  0.8293546295279024           146,581,000.0  
4         Brazil  0.6333153580042348           115,508,445.0  


The country that had 935,282,277 more cell phones in 2017 is India


If you have the time and curiosity, what other questions can you ask and answer with this data?

I would love to go in depth with this data by comparing prediction charts, to see what the possible number of cell phones per person and total would be by the year 2020. It is amazing how far we have come, and how easy technology is accessible in our world. I think it also would be interesting to figure out the total average pruce each person spends on cell phones, also as a whole per country, and world wide as well. Comparing data any way possible to me is very satisfying, I really enjoy it! For example last week, all I really knew about cancer is that it's scary. And within a three hour sprint challenge I knew possible outcome survival rates of different breast cancer patients! Data science to me is incredible, and is a very useful tool for learning new amazing things!