# Week 5 Day 1 - Pandas

[Pandas](https://pandas.pydata.org) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

In [292]:
import pandas as pd

In [293]:
# make a dictionary with lists as values
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

In [294]:
mydataset

{'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}

In [295]:
#make it a dataframe
mycars = pd.DataFrame(mydataset)
mycars


Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [296]:
#get the information of your dataframe

mycars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   cars      3 non-null      object
 1   passings  3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


In [297]:
#get the shape of your dataframe

mycars.shape

(3, 2)

In [298]:
#get the columns

mycars.columns

Index(['cars', 'passings'], dtype='object')

In [299]:
#get the rows

mycars.values

array([['BMW', 3],
       ['Volvo', 7],
       ['Ford', 2]], dtype=object)

In [300]:
#get the axis
mycars.axes


[RangeIndex(start=0, stop=3, step=1),
 Index(['cars', 'passings'], dtype='object')]

In [301]:
#get the first row
mycars.loc[0]

cars        BMW
passings      3
Name: 0, dtype: object

In [302]:
mycars.loc[2]


cars        Ford
passings       2
Name: 2, dtype: object

In [303]:
type(mycars.loc[2])

pandas.core.series.Series

In [304]:
#use a list of indexs:
mycars.loc[[0, 2]]

Unnamed: 0,cars,passings
0,BMW,3
2,Ford,2


In [305]:
#get a column


mycars['cars']

0      BMW
1    Volvo
2     Ford
Name: cars, dtype: object

In [306]:
#get the rows

mycars.loc[[0, 1, 2]]


Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [307]:
#what are the datatypes of the dataframe

mycars.dtypes

cars        object
passings     int64
dtype: object

In [308]:
#change 'passings' to float

mycars['passings'].astype('float')

0    3.0
1    7.0
2    2.0
Name: passings, dtype: float64

In [309]:
#look at it again

mycars.dtypes


cars        object
passings     int64
dtype: object

In [310]:
mycars['passings'] = mycars['passings'].astype('float')

In [311]:
mycars.dtypes

cars         object
passings    float64
dtype: object

In [312]:
#get rid of the last row

mycars.drop(2)

Unnamed: 0,cars,passings
0,BMW,3.0
1,Volvo,7.0


In [313]:
mycars

Unnamed: 0,cars,passings
0,BMW,3.0
1,Volvo,7.0
2,Ford,2.0


In [314]:
newDf = mycars.drop(2)

In [315]:
newDf

Unnamed: 0,cars,passings
0,BMW,3.0
1,Volvo,7.0


In [316]:
#get rid of the passings column

newDf2 = newDf.drop('passings', axis = 'columns')

In [317]:
newDf2

Unnamed: 0,cars
0,BMW
1,Volvo


<!--  -->

### NaN & empty data

In [318]:
import numpy as np

uglyData = {
  'cars': ["BMW", 'Jeep', "Ford", 'Chrysler'],
  'passings': [3, np.nan, 2, 'NaN']
}

uglyDF = pd.DataFrame(uglyData)

In [319]:
uglyDF

Unnamed: 0,cars,passings
0,BMW,3.0
1,Jeep,
2,Ford,2.0
3,Chrysler,


In [321]:
#drop the Nan

uglyDF_clean = uglyDF.dropna()

In [322]:
uglyDF_clean

Unnamed: 0,cars,passings
0,BMW,3.0
2,Ford,2.0
3,Chrysler,


In [325]:
#replace it with 0

uglyDF_clean2 = uglyDF.fillna(0)
uglyDF_clean2

Unnamed: 0,cars,passings
0,BMW,3.0
1,Jeep,0.0
2,Ford,2.0
3,Chrysler,


<!--  -->

### .csv Files

**pd.read_csv** 

A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

*pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’,  index_col=None, usecols=None, engine=None, skiprows=None, nrows=None)*

In [332]:
#import tv_shows.csv

df = pd.read_csv("tv_shows.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'tv_shows.csv'

In [215]:
#get a preview


In [216]:
#what columns of the csv file


In [217]:
# Return the number of not empty cells for each column/row


In [218]:
#only import the columns ["title", "year", "rating", "votes"]


In [219]:
#what is the maximum rating? 


In [220]:
#what is the minimum rating?


In [221]:
#find the avg of all of the ratings


In [222]:
#what data types of the dataframe? 


In [223]:
#can you change the datatype of votes?


In [224]:
#set the index to be the names of the tv show


In [225]:
#get the year of hte new dataframe


<!--  -->

### .json files

**pd.read_json**

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries. If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly.

In [226]:
dataJson = {
    'item1':{
        "0":60,
        "1":60,
        "2":60
    },
    'item2':{
        '0':100,
        '1':100,
        '2':100
    }
}



In [227]:
#read in nationalParks.json


In [228]:
#only get the ['date_established_readable','description', 'title', 'visitors', 'world_heritage_site', 'states ]


In [229]:
#get the datatypes

In [230]:
#find the national parks that are world heritage sites


<!--  -->

#### Exercise 1: get the national parks with over a million visitors

<!--  -->

### Exercise 2: Create a dictionary of the number of national parks per state
