<a href="https://colab.research.google.com/github/extrajp2014/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [2]:
df1=cell_phones
df2=population
df3=geo_country_codes
print(cell_phones.head())
print(population.head())

   geo  time  cell_phones_total
0  abw  1960                0.0
1  abw  1965                0.0
2  abw  1970                0.0
3  abw  1975                0.0
4  abw  1976                0.0
   geo  time  population_total
0  afg  1800           3280000
1  afg  1801           3280000
2  afg  1802           3280000
3  afg  1803           3280000
4  afg  1804           3280000


In [3]:
# Join data
# df1=cell_phones
# df2=population
# df3=geo_country_codes
merged=df1.merge(df2, on=['geo', 'time'], how='inner')
print(merged.shape)
merged.head()

(8590, 4)


Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [4]:
# df1=cell_phones
# df2=population
# df3=geo_country_codes
columns = ['geo', 'country']
final=merged.merge(df3[columns], on='geo', how='inner')
print(final.shape)
final.head()

(8590, 5)


Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


In [38]:
df=final
import numpy as np
def all_numeric(df):
  return all((df.dtypes==np.number) | (df.dtypes==bool))

def no_nulls(df):
  return not any(df.isnull().sum())

def ready_for_sklearn(df):
  return all_numeric(df) and no_nulls(df)

print(all_numeric(df), ready_for_sklearn(df), no_nulls(df),final.isnull().sum().sum())


False False True 0


## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [5]:
final.dtypes


geo                   object
time                   int64
cell_phones_total    float64
population_total       int64
country               object
dtype: object

In [0]:
# df=final
# df1=cell_phones
# df2=population
# df3=geo_country_codes
# Calculate the number of cell phones per person, and add this column onto your dataframe.
# (You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

final['Cell Phones Per Person'] = final.cell_phones_total / final.population_total

In [97]:
pd.set_option('display.max_rows', 500)
final.sort_values(by='country').tail(400).head(10)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,Cell Phones Per Person
8130,USA,2013,310698000.0,315536676,United States,0.9846652501340288
8120,USA,2003,160637000.0,290027624,United States,0.5538679308699229
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
8128,USA,2011,297404000.0,311051373,United States,0.9561250192584748
8118,USA,2001,128500000.0,284852391,United States,0.4511108351553208
8092,USA,1960,0.0,186808228,United States,0.0
8094,USA,1970,0.0,209588150,United States,0.0
8116,USA,1999,86047003.0,278862277,United States,0.3085645140880779


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [32]:
temp = final.geo.str.upper()
type(temp)
final['geo']=temp
final.head()


Unnamed: 0,geo,time,cell_phones_total,population_total,country,Cell Phones Per Person
0,AFG,1960,0.0,8996351,Afghanistan,0.0
1,AFG,1965,0.0,9938414,Afghanistan,0.0
2,AFG,1970,0.0,11126123,Afghanistan,0.0
3,AFG,1975,0.0,12590286,Afghanistan,0.0
4,AFG,1976,0.0,12840299,Afghanistan,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [66]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format
# geo                        object
# time                        int64
# cell_phones_total         float64
# population_total            int64
# country                    object
# Cell Phones Per Person    float64
# dtype: object
final.dtypes

pd.set_option('display.max_columns', 500)
print(final.describe())
print("\n")
print(final.geo.describe())
print("\n")
print(final.country.describe())

                     time    cell_phones_total     population_total  Cell Phones Per Person
count             8,590.0              8,590.0              8,590.0                 8,590.0
mean  1,994.1934807916182  9,004,949.642905472 29,838,230.581722934     0.27963855580591535
std    14.257974607310278 55,734,084.872176506 116,128,377.47477297     0.45424665621404714
min               1,960.0                  0.0              4,433.0                     0.0
25%               1,983.0                  0.0          1,456,148.0                     0.0
50%               1,995.0              6,200.0          5,725,062.5   0.0015636266438163813
75%               2,006.0          1,697,652.0         18,105,812.0      0.4611491855201403
max               2,017.0      1,474,097,000.0      1,409,517,397.0       2.490242818521353


count     8590
unique     195
top        CAN
freq        46
Name: geo, dtype: object


count        8590
unique        195
top       Ukraine
freq           46
Name: count

In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [67]:
# print(final.sort_values(by='cell_phones_total', ascending=False).head())
top=final[['country','cell_phones_total']].sort_values('cell_phones_total', ascending=False).drop_duplicates(['country'])
print(top.head())

            country  cell_phones_total
1496          China    1,474,097,000.0
3595          India    1,168,902,277.0
3549      Indonesia      458,923,202.0
8134  United States      395,881,000.0
1081         Brazil      280,728,796.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [86]:
condition = (final.country == 'United States') & (final.cell_phones_total > final.population_total)
temp=final.loc[condition, 'time']
print(type(temp))
print("\n")
print(temp.head())
print("\n")
print("first year that the USA had more cell phones than people")
print(temp.head(1))


<class 'pandas.core.series.Series'>


8131    2014
8132    2015
8133    2016
8134    2017
Name: time, dtype: int64


first year that the USA had more cell phones than people
8131    2014
Name: time, dtype: int64


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [95]:
# Current Variables
# ['pd', 'population', 'df1', 'df2', 'df3', 'merged', 'columns', 'final', 'temp'
#     , 'np', 'df', 'top', 'condition']
# Current df column names
# ['geo', 'time', 'cell_phones_total', 'population_total', 'country', 
#  'Cell Phones Per Person']

final2=final
final2.pivot_table(index='country', 
                   columns='time', 
                   values='cell_phones_total').head()


time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25000.0,200000.0,600000.0,1200000.0,2520366.0,4668096.0,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2300.0,3300.0,5600.0,11008.0,29791.0,392650.0,851000.0,1100000.0,1259590.0,1530244.0,1909885.0,2322436.0,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,470.0,4781.0,4781.0,4781.0,1348.0,4691.0,11700.0,17400.0,18000.0,72000.0,86000.0,100000.0,450244.0,1446927.0,4882414.0,13661355.0,20997954.0,27562721.0,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,770.0,780.0,784.0,2825.0,5488.0,8618.0,14117.0,20600.0,23543.0,29429.0,32790.0,51893.0,58366.0,64560.0,69004.0,63503.0,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1100.0,1824.0,1994.0,3298.0,7052.0,9820.0,24000.0,25806.0,75000.0,140000.0,350000.0,740000.0,1611118.0,3054620.0,4961536.0,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [0]:
descripts_df = (final2.groupby('product_name')
                  .order_hour_of_day.agg(['mean', 'count']))


If you have the time and curiosity, what other questions can you ask and answer with this data?