## **Welcome to the McMaster Artificial Intelligence Society's beginner intro to using Pandas!**
> #### Take a look at the code below, and make note of some of the useful features of the Pandas library. You may find it beneficial to reference this notebook in the future when quickly looking for a particular function. Here are some additional resources:
> > __[Article](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)__ <br>
> ### Happy Coding!

---
### What is Pandas?
> #### Unlike with NumPy, where we're working with arrays in the context of mathematical matrices, you can think of Pandas as working with an Excel sheet. Pandas is useful when you need to filter/sort/manipulate a dataset that you have in the form of a table. As you progress into ML programming, you will observe a large number of datasets in files such as CSV format. Pandas will be useful in cleaning up this data before plugging it into an algorithm or a neural network (where you might now use NumPy).

---
#### Ok, let's get started! The first thing we need to do is import the pandas library: <br>

In [2]:
import pandas as pd

### Great, we'll start by creating a one-dimensional array called a Series:
> # pd.Series(d, index=[ ])
 - The parameter "d" takes in a list of values for your data points
 - The parameter "index=" takes in a list of the corresponding indices for your data

In [3]:
s = pd.Series(["John", "Martha", "Amit", "Jill"], index=[1, 2, 3, 4])
s

1      John
2    Martha
3      Amit
4      Jill
dtype: object

#### Great!  Let's make a Pandas dataframe. We need to first create the data we'll be using in our table. We'll do it in the form of a list: <br>

In [4]:
data = [10, 20, 30, 40]

### Ok, now let's put this list inside a pandas dataframe. We'll use the following general code:
> # pd.DataFrame(d, columns=[ ])
 - The parameter "data" takes in the list we created above
 - The parameter "columns" takes in a list of the names of our columns

In [5]:
df = pd.DataFrame(data, columns=["Number"])
print(df)
df

   Number
0      10
1      20
2      30
3      40


Unnamed: 0,Number
0,10
1,20
2,30
3,40


#### Notice in the code above, we printed the dataframe using the print() function, as well as simply restating the name of your dataframe. Just acknowledge the difference between the two.
---
### Now let's try to create another dataframe from a dictionary: <br>

In [6]:
newData = {'Name': ['James',  'Sarah',  'Frankie'],

'Gender': ['Male',  'Female',  'Male'],

'Age': [12, 56, 20]}

df = pd.DataFrame(newData)
df

Unnamed: 0,Age,Gender,Name
0,12,Male,James
1,56,Female,Sarah
2,20,Male,Frankie


#### Notice in the above code we did not have to declare the column names, as Pandas automatically uses the keys in our dictionary.
---
### We can pull a particular column of values from our dataframe:
> # df[columnName]

### We can also pull a particular row of values from our dataframe:
> # df.loc(rowNumber)

In [7]:
print(df['Name'], '\n')

print(df.loc[1])

0      James
1      Sarah
2    Frankie
Name: Name, dtype: object 

Age           56
Gender    Female
Name       Sarah
Name: 1, dtype: object


### We can add new columns to our dataframe:
> # df[name] = s
 - The parameter "name" takes a string as the title of your new column
 - The parameter "s" takes a Pandas series

### Or new rows:
> # df = df.append(df2)
 - This will append a second dataframe to your original dataframe

In [8]:
df['Favourite Colour'] = pd.Series(['Red', 'Red', 'Blue'], index=[0, 1, 2])
print(df)

df2 = pd.DataFrame([["Susan", 'Female', 73, 'Green'], ["Jimmy", 'Male', 7, 'Orange']], columns=["Name", 'Gender', 'Age', 'Favourite Colour'])
df = df.append(df2)
df

   Age  Gender     Name Favourite Colour
0   12    Male    James              Red
1   56  Female    Sarah              Red
2   20    Male  Frankie             Blue


Unnamed: 0,Age,Favourite Colour,Gender,Name
0,12,Red,Male,James
1,56,Red,Female,Sarah
2,20,Blue,Male,Frankie
0,73,Green,Female,Susan
1,7,Orange,Male,Jimmy


### Similarly, we can delete a column:
> # del df[columnName]
 - This will delete a column with the specified name

### Or a row:
> # df = df.drop(labelNumber)
 - This will delete rows with the specified label number


In [9]:
df

Unnamed: 0,Age,Favourite Colour,Gender,Name
0,12,Red,Male,James
1,56,Red,Female,Sarah
2,20,Blue,Male,Frankie
0,73,Green,Female,Susan
1,7,Orange,Male,Jimmy


In [10]:
del df['Favourite Colour']
df

Unnamed: 0,Age,Gender,Name
0,12,Male,James
1,56,Female,Sarah
2,20,Male,Frankie
0,73,Female,Susan
1,7,Male,Jimmy


In [11]:
df = df.drop(2)
df

Unnamed: 0,Age,Gender,Name
0,12,Male,James
1,56,Female,Sarah
0,73,Female,Susan
1,7,Male,Jimmy


### Excellent! We can refer to the following series of functions to retrieve basic information from our Dataframe:
> # df.shape
 - This will provide the shape of our Dataframe (rows, columns)

> # df.index
 - This will provide our indices

> # df.columns
 - This will provide our columns

> # df.info()
 - This will provide a summary of our data in the Dataframe

> # df.count()
 - This will provide the number of non-null values in the Dataframe

In [12]:
df.shape

(4, 3)

In [13]:
df.index

Int64Index([0, 1, 0, 1], dtype='int64')

In [14]:
df.columns

Index(['Age', 'Gender', 'Name'], dtype='object')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 1
Data columns (total 3 columns):
Age       4 non-null int64
Gender    4 non-null object
Name      4 non-null object
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes


In [16]:
df.count()

Age       4
Gender    4
Name      4
dtype: int64

### Furthermore, we can use the following functions to derive some basic sums and values for our Dataframe:
> # df.sum()
 - Returns the sum of each column
 
> # df.cumsum()
 - Returns the cumulative sum of each column
 
> # df.min()/df.max()
 - Returns the minimum/maximum value
 
> # df.describe()
 - Returns a summary of some basic statistics in your Dataframe

> # df.mean()
 - Returns the mean of each column
 
> # df.median() 
 - Returns the median of each column

In [17]:
df.sum()

Age                        148
Gender    MaleFemaleFemaleMale
Name      JamesSarahSusanJimmy
dtype: object

In [18]:
df.cumsum()

Unnamed: 0,Age,Gender,Name
0,12,Male,James
1,68,MaleFemale,JamesSarah
0,141,MaleFemaleFemale,JamesSarahSusan
1,148,MaleFemaleFemaleMale,JamesSarahSusanJimmy


In [19]:
df.min()

Age            7
Gender    Female
Name       James
dtype: object

In [20]:
df.max()

Age          73
Gender     Male
Name      Susan
dtype: object

In [21]:
df.describe()

Unnamed: 0,Age
count,4.0
mean,37.0
std,32.567878
min,7.0
25%,10.75
50%,34.0
75%,60.25
max,73.0


In [22]:
df.mean()

Age    37.0
dtype: float64

In [38]:
df.median()

0    1.0
dtype: float64

#### Great! Lastly, we'll go over importing data from a csv file. Let's start by importing the csv package: <br>

In [41]:
import csv

#### We can use the following code:
> # pd.read_csv("csvName", header=None, nrows=n)
 - The "csvName" parameter takes in the name of your desired csv file
 - The "header=" parameter takes in a header if you'd wish to add it to your new dataframe. We won't use this parameter so we'll write "None"
 - The "nrows=" parameter takes in the number of rows that you want to import from the csv file

In [44]:
df = pd.read_csv('Sample.csv', header=None, nrows=20)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
1,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
2,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
3,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
4,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
5,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768
6,5828 PEPPERMILL CT,SACRAMENTO,95841,CA,3,1,1122,Condo,Wed May 21 00:00:00 EDT 2008,89921,38.662595,-121.327813
7,6048 OGDEN NASH WAY,SACRAMENTO,95842,CA,3,2,1104,Residential,Wed May 21 00:00:00 EDT 2008,90895,38.681659,-121.351705
8,2561 19TH AVE,SACRAMENTO,95820,CA,3,1,1177,Residential,Wed May 21 00:00:00 EDT 2008,91002,38.535092,-121.481367
9,11150 TRINITY RIVER DR Unit 114,RANCHO CORDOVA,95670,CA,2,2,941,Condo,Wed May 21 00:00:00 EDT 2008,94905,38.621188,-121.270555
