### Setup
Besides Pandas and Numpy for data and numerical manipulation, we will also import Matplotlib and Seaborn for Visualization.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Loading data
For most practical Data Analysis purposes, we wont be creating datasets. We will be getting the data from a database, which may be an SQL database, Excel file, Web API or a simple CSV file.

Lets load in some fake data from a CSV file.

In [4]:
# Reading CSV and setting index to 0th column
df = pd.read_csv('fakedata.csv', index_col=0)
# Showing first 5 rows
df.head()

Unnamed: 0_level_0,Height,Weight,Sex
SL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,67.01895,175.92944,Male
2,63.456494,156.399676,Male
3,71.195382,186.604926,Male
4,71.640805,213.741169,Male
5,64.766329,167.127461,Male


While setting the index we set it using an __int__, but we could also set `index_col` to the name of the column _`'SL'`_

Reading excel files is just as simple, but we also have to specify the sheet name. By default the `sheet_name` is set to 'Sheet1', which is the default name of the first sheet in Excel.

In [6]:
df = pd.read_excel('fakedata.xlsx', , index_col='SL')
df.head()

Unnamed: 0_level_0,Height,Weight,Sex
SL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,67.01895,175.92944,Male
2,63.456494,156.399676,Male
3,71.195382,186.604926,Male
4,71.640805,213.741169,Male
5,64.766329,167.127461,Male


If we want to see the __number of rows and columns__ of a DataFrame, we can access it through the `shape` attribute

In [9]:
df.shape

(40, 3)

We can see that the DataFrame has 40 rows and 3 columns. To see the column names use the `columns` attribute

In [10]:
df.columns

Index(['Height', 'Weight', 'Sex'], dtype='object')

## Indexing DataFrames
#### Getting columns
We can think of DataFrames as being like Dictionaries where each key is mapped to a specific value.
We can access one or more columns by passing in the _Column Name_ or _a List of Column Names_.

In [16]:
# Since the output is 40 rows we are only showing the last few lines
df[['Sex', 'Weight']].tail()

Unnamed: 0_level_0,Sex,Weight
SL,Unnamed: 1_level_1,Unnamed: 2_level_1
36,Female,134.228371
37,Female,136.592801
38,Female,144.375968
39,Female,152.919821
40,Female,150.995673


#### Getting a specific cell value
__Simplistic method__: After we get a specific column, we can again index that column for the specific row that we want (chain indexing).

In [20]:
df['Sex'][35]

'Female'

__Prefered way__: A better way is to use the `loc` and/or `iloc` methods.