# Week 5 Day 1 - Pandas

[Pandas](https://pandas.pydata.org) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

In [434]:
import pandas as pd

In [435]:
# make a dictionary with lists as values
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

In [436]:
mydataset

{'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}

In [437]:
#make it a dataframe
mycars = pd.DataFrame(mydataset)
mycars


Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [438]:
#get the information of your dataframe

mycars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   cars      3 non-null      object
 1   passings  3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


In [439]:
#get the shape of your dataframe

mycars.shape

(3, 2)

In [440]:
#get the columns

mycars.columns

Index(['cars', 'passings'], dtype='object')

In [441]:
#get the rows

mycars.values

array([['BMW', 3],
       ['Volvo', 7],
       ['Ford', 2]], dtype=object)

In [442]:
#get the axis
mycars.axes


[RangeIndex(start=0, stop=3, step=1),
 Index(['cars', 'passings'], dtype='object')]

In [443]:
#get the first row
mycars.loc[0]

cars        BMW
passings      3
Name: 0, dtype: object

In [444]:
mycars.loc[2]


cars        Ford
passings       2
Name: 2, dtype: object

In [445]:
type(mycars.loc[2])

pandas.core.series.Series

In [446]:
#use a list of indexs:
mycars.loc[[0, 2]]

Unnamed: 0,cars,passings
0,BMW,3
2,Ford,2


In [447]:
#get a column


mycars['cars']

0      BMW
1    Volvo
2     Ford
Name: cars, dtype: object

In [448]:
#get the rows

mycars.loc[[0, 1, 2]]


Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [449]:
#what are the datatypes of the dataframe

mycars.dtypes

cars        object
passings     int64
dtype: object

In [450]:
#change 'passings' to float

mycars['passings'].astype('float')

0    3.0
1    7.0
2    2.0
Name: passings, dtype: float64

In [451]:
#look at it again

mycars.dtypes


cars        object
passings     int64
dtype: object

In [452]:
mycars['passings'] = mycars['passings'].astype('float')

In [453]:
mycars.dtypes

cars         object
passings    float64
dtype: object

In [454]:
#get rid of the last row

mycars.drop(2)

Unnamed: 0,cars,passings
0,BMW,3.0
1,Volvo,7.0


In [455]:
mycars

Unnamed: 0,cars,passings
0,BMW,3.0
1,Volvo,7.0
2,Ford,2.0


In [456]:
newDf = mycars.drop(2)

In [457]:
newDf

Unnamed: 0,cars,passings
0,BMW,3.0
1,Volvo,7.0


In [458]:
#get rid of the passings column

newDf2 = newDf.drop('passings', axis = 'columns')

In [459]:
newDf2

Unnamed: 0,cars
0,BMW
1,Volvo


<!--  -->

### NaN & empty data

In [460]:
import numpy as np

uglyData = {
  'cars': ["BMW", 'Jeep', "Ford", 'Chrysler'],
  'passings': [3, np.nan, 2, 'NaN']
}

uglyDF = pd.DataFrame(uglyData)

In [461]:
uglyDF

Unnamed: 0,cars,passings
0,BMW,3.0
1,Jeep,
2,Ford,2.0
3,Chrysler,


In [462]:
#drop the Nan

uglyDF_clean = uglyDF.dropna()

In [463]:
uglyDF_clean

Unnamed: 0,cars,passings
0,BMW,3.0
2,Ford,2.0
3,Chrysler,


In [464]:
#replace it with 0

uglyDF_clean2 = uglyDF.fillna(0)
uglyDF_clean2

Unnamed: 0,cars,passings
0,BMW,3.0
1,Jeep,0.0
2,Ford,2.0
3,Chrysler,


<!--  -->

### .csv Files

**pd.read_csv** 

A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

*pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’,  index_col=None, usecols=None, engine=None, skiprows=None, nrows=None)*

In [465]:
#import tv_shows.csv

df = pd.read_csv("tv_shows.csv")

In [466]:
#get a preview

df


Unnamed: 0.1,Unnamed: 0,title,year,runtime,rating,votes,genre,text
0,0,Game of Thrones,(2011 TV Series),55 mins.,9.5,748557,"[u'Adventure', u'Drama', u'Fantasy']",Several noble families fight for control of th...
1,1,Breaking Bad,(2008 TV Series),45 mins.,9.5,662459,"[u'Crime', u'Drama', u'Thriller']",A chemistry teacher diagnosed with a terminal ...
2,2,The Walking Dead,(2010 TV Series),44 mins.,8.7,500301,"[u'Drama', u'Horror']",Sheriff's Deputy Rick Grimes leads a group of ...
3,3,The Big Bang Theory,(2007 TV Series),22 mins.,8.5,438226,[u'Comedy'],A woman who moves into an apartment across the...
4,4,Dexter,(2006 TV Series),55 mins.,8.9,419031,"[u'Crime', u'Drama', u'Mystery', u'Thriller']",A Miami police forensics expert moonlights as ...
...,...,...,...,...,...,...,...,...
1969,1969,Cristela,(2014 TV Series),30 mins.,6.3,1510,[u'Comedy'],"In her sixth year of law school, Cristela is f..."
1970,1970,Power Rangers in Space,(1998 TV Series),30 mins.,7.2,1504,"[u'Action', u'Adventure', u'Family', u'Sci-Fi']",The most evil forces of the universe (Rita &am...
1971,1971,Reading Rainbow,(1983 TV Series),30 mins.,8.5,1501,[u'Family'],Levar Burton introduces young viewers to illus...
1972,1972,Martial Law,(1998 TV Series),45 mins.,7.1,1501,"[u'Comedy', u'Crime', u'Action']",A Shanghai cop who is a master of martial arts...


In [467]:
#what columns of the csv file

df.columns

Index(['Unnamed: 0', 'title', 'year', 'runtime', 'rating', 'votes', 'genre',
       'text'],
      dtype='object')

In [468]:
# Return the number of not empty cells for each column/row

df.count()

Unnamed: 0    1974
title         1974
year          1974
runtime       1814
rating        1974
votes         1974
genre         1970
text          1941
dtype: int64

In [469]:
df2 = df.dropna()
df2.count()

Unnamed: 0    1792
title         1792
year          1792
runtime       1792
rating        1792
votes         1792
genre         1792
text          1792
dtype: int64

In [470]:
#only import the columns ["title", "year", "rating", "votes"]

df3 = pd.read_csv('tv_shows.csv', usecols = ["title", "year", "rating", "votes"])

df3.head()

Unnamed: 0,title,year,rating,votes
0,Game of Thrones,(2011 TV Series),9.5,748557
1,Breaking Bad,(2008 TV Series),9.5,662459
2,The Walking Dead,(2010 TV Series),8.7,500301
3,The Big Bang Theory,(2007 TV Series),8.5,438226
4,Dexter,(2006 TV Series),8.9,419031


In [471]:
#what is the maximum rating? 

df3['rating'].max()

9.6

In [472]:
#what is the minimum rating?

df3['rating'].min()


1.8

In [473]:
#find the avg of all of the ratings

avgNum = df3['rating'].mean()
avgNum

7.486524822695036

In [474]:
#what data types of the dataframe? 

df3.dtypes

title      object
year       object
rating    float64
votes      object
dtype: object

In [475]:
#can you change the datatype of votes?

df3['votes'] = df3['votes'].astype('float')


ValueError: could not convert string to float: '748,557'

In [397]:
#set the index to be the names of the tv show

df4 = df3.set_index('title')

df4

Unnamed: 0_level_0,year,rating,votes
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Game of Thrones,(2011 TV Series),9.5,748557
Breaking Bad,(2008 TV Series),9.5,662459
The Walking Dead,(2010 TV Series),8.7,500301
The Big Bang Theory,(2007 TV Series),8.5,438226
Dexter,(2006 TV Series),8.9,419031
...,...,...,...
Cristela,(2014 TV Series),6.3,1510
Power Rangers in Space,(1998 TV Series),7.2,1504
Reading Rainbow,(1983 TV Series),8.5,1501
Martial Law,(1998 TV Series),7.1,1501


In [398]:
#get the year of the new dataframe

df4['year']

title
Game of Thrones                 (2011 TV Series)
Breaking Bad                    (2008 TV Series)
The Walking Dead                (2010 TV Series)
The Big Bang Theory             (2007 TV Series)
Dexter                          (2006 TV Series)
                                      ...       
Cristela                        (2014 TV Series)
Power Rangers in Space          (1998 TV Series)
Reading Rainbow                 (1983 TV Series)
Martial Law                     (1998 TV Series)
Beast Machines: Transformers    (1999 TV Series)
Name: year, Length: 1974, dtype: object

<!--  -->

### .json files

**pd.read_json**

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries. If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly.

In [376]:
dataJson = {
    'item1':{
        "0":60,
        "1":60,
        "2":60
    },
    'item2':{
        '0':100,
        '1':100,
        '2':100
    }
}



In [399]:
dfjson = pd.DataFrame(dataJson)

In [400]:
#read in nationalParks.json

df = pd.read_json("nationalParks.json")
df.head()

Unnamed: 0,area,coordinates,date_established_readable,date_established_unix,description,image,nps_link,states,title,id,visitors,world_heritage_site
0,"{'acres': '49,057.36', 'square_km': '198.5'}","{'latitude': 44.35, 'longitude': -68.21}","February 26, 1919",-1604599200,Covering most of Mount Desert Island and other...,"{'url': 'acadia.jpg', 'attribution': 'PixelBay...",https://www.nps.gov/acad/index.htm,"[{'id': 'state_maine', 'title': 'Maine'}]",Acadia,park_acadia,3303393,False
1,"{'acres': '8,256.67', 'square_km': '33.4'}","{'latitude': -14.25, 'longitude': -170.68}","October 31, 1988",594280800,The southernmost National Park is on three Sam...,"{'url': 'american-samoa.jpg', 'attribution': '...",https://www.nps.gov/npsa/index.htm,"[{'id': 'state_american-samoa', 'title': 'Amer...",American Samoa,park_american-samoa,28892,False
2,"{'acres': '76,678.98', 'square_km': '310.3'}","{'latitude': 38.68, 'longitude': -109.57}","November 12, 1971",58773600,"This site features more than 2,000 natural san...","{'url': 'arches.jpg', 'attribution': 'PixelBay...",https://www.nps.gov/arch/index.htm,"[{'id': 'state_utah', 'title': 'Utah'}]",Arches,park_arches,1585718,False
3,"{'acres': '242,755.94', 'square_km': '982.4'}","{'latitude': 43.75, 'longitude': -102.5}","November 10, 1978",279525600,"The Badlands are a collection of buttes, pinna...","{'url': 'badlands.jpg', 'attribution': 'PixelB...",https://www.nps.gov/badl/index.htm,"[{'id': 'state_south-dakota', 'title': 'South ...",Badlands,park_badlands,996263,False
4,"{'acres': '801,163.21', 'square_km': '3,242.2'}","{'latitude': 29.25, 'longitude': -103.25}","June 12, 1944",-806439600,Named for the prominent bend in the Rio Grande...,"{'url': 'big-bend.jpg', 'attribution': 'PixelB...",https://www.nps.gov/bibe/index.htm,"[{'id': 'state_texas', 'title': 'Texas'}]",Big Bend,park_big-bend,388290,False


In [401]:
df.dtypes

area                         object
coordinates                  object
date_established_readable    object
date_established_unix         int64
description                  object
image                        object
nps_link                     object
states                       object
title                        object
id                           object
visitors                     object
world_heritage_site            bool
dtype: object

In [404]:
#only get the ['date_established_readable','description', 'title', 'visitors', 'world_heritage_site', 'states ]


df2 = df[['title', 'visitors', 'date_established_readable','description', 'world_heritage_site', 'states' ]
]
df2.head()

Unnamed: 0,title,visitors,date_established_readable,description,world_heritage_site,states
0,Acadia,3303393,"February 26, 1919",Covering most of Mount Desert Island and other...,False,"[{'id': 'state_maine', 'title': 'Maine'}]"
1,American Samoa,28892,"October 31, 1988",The southernmost National Park is on three Sam...,False,"[{'id': 'state_american-samoa', 'title': 'Amer..."
2,Arches,1585718,"November 12, 1971","This site features more than 2,000 natural san...",False,"[{'id': 'state_utah', 'title': 'Utah'}]"
3,Badlands,996263,"November 10, 1978","The Badlands are a collection of buttes, pinna...",False,"[{'id': 'state_south-dakota', 'title': 'South ..."
4,Big Bend,388290,"June 12, 1944",Named for the prominent bend in the Rio Grande...,False,"[{'id': 'state_texas', 'title': 'Texas'}]"


In [405]:
#get the 

df2.dtypes

title                        object
visitors                     object
date_established_readable    object
description                  object
world_heritage_site            bool
states                       object
dtype: object

In [408]:
#find the national parks that are world heritage sites

whsNum = df2[df2['world_heritage_site'] == True]
len(whsNum)


14

<!--  -->

#### Exercise 1: get the national parks with over a million visitors

In [423]:

overMili = df2

overMili['visitors'] = overMili['visitors'].astype('float')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  overMili['visitors'] = overMili['visitors'].astype('float')


In [426]:
overMili.dtypes

milivis = [overMili[overMili['visitors'] > 1000000]]
milivis

[                    title    visitors date_established_readable  \
 0                  Acadia   3303393.0         February 26, 1919   
 2                  Arches   1585718.0         November 12, 1971   
 7            Bryce Canyon   2365110.0         February 25, 1928   
 9            Capitol Reef   1064904.0         December 18, 1971   
 14        Cuyahoga Valley   2423390.0          October 11, 2000   
 15           Death Valley   1296283.0          October 31, 1994   
 20                Glacier   2946681.0              May 11, 1910   
 22           Grand Canyon   5969811.0         February 26, 1919   
 23            Grand Teton   3270076.0         February 26, 1929   
 26  Great Smoky Mountains  11312786.0             June 15, 1934   
 28              Haleakala   1263558.0            August 1, 1916   
 29       Hawaii Volcanoes   1887580.0            August 1, 1916   
 30            Hot Springs   1544300.0             March 4, 1921   
 32            Joshua Tree   2505286.0          

<!--  -->

### Exercise 2: Create a dictionary of the number of national parks per state


In [433]:
stateParks = {}


for row in df2['states']:
    stateName = row['title']
    
    if stateName in stateParks.keys():
        stateParks[stateName] += 1

    else:
        stateParks[stateName] = 1


TypeError: list indices must be integers or slices, not str

In [476]:
stateParks

{}