# Loading data

In [1]:
import pandas as pd

df = pd.read_csv("data\weather_data.csv")

# Showing the whole DataFrame
df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-17,32,6,Rain
1,01-02-17,35,7,Sunny
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow
4,01-05-17,32,4,Rain
5,01-06-17,31,2,Sunny


# Dealing with rows and columns

A DataFrame's `shape` attribute shows its dimension lengths. For 2D tabular data this means the number of __rows__ and __columns__.

In [2]:
df.shape

(6, 4)

Here we can see that our DataFrame has 6 rows and 4 cols.

#### Getting a slice of rows

In [3]:
df[2:5]

Unnamed: 0,day,temperature,windspeed,event
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow
4,01-05-17,32,4,Rain


We can see all the column names by accessing a DataFrame's `columns` attribute

In [4]:
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

#### Getting a column
To show a specific column we can call it as `DataFrame.columnName` or by indexing as `DataFrame['columnName']`

In [5]:
df.day

0    01-01-17
1    01-02-17
2    01-03-17
3    01-04-17
4    01-05-17
5    01-06-17
Name: day, dtype: object

In [6]:
df['day']

0    01-01-17
1    01-02-17
2    01-03-17
3    01-04-17
4    01-05-17
5    01-06-17
Name: day, dtype: object

#### Getting a slice of columns
To get multiple columns we will pass a __list__ of column names, as: `DataFrame[['col1', 'col2']]`

In [7]:
df[['event', 'day', 'temperature']]

Unnamed: 0,event,day,temperature
0,Rain,01-01-17,32
1,Sunny,01-02-17,35
2,Snow,01-03-17,28
3,Snow,01-04-17,24
4,Rain,01-05-17,32
5,Sunny,01-06-17,31


__Note__: The returned DataFrame columns are ordered as we entered them, not as they were in the original DataFrame.

# Operations: 
`min`, `max`, `mean`, `std`, `describe`

In [8]:
# Showing all the data
df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-17,32,6,Rain
1,01-02-17,35,7,Sunny
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow
4,01-05-17,32,4,Rain
5,01-06-17,31,2,Sunny


##### Maximum of a column
To find the maximum value in a column we index the column name and call its `max` method

In [9]:
df['temperature'].max()

35

##### Maximum of a column

In [10]:
df['temperature'].min()

24

##### Average of a column

In [11]:
df['temperature'].mean()

30.333333333333332

#### Standard deviation of a column

In [12]:
df['temperature'].std()

3.8297084310253524

##### Descriptive statistics for all numerical columns
If we call a DataFrames `describe` method, it shows us descriptive stats for all columns with numerical data. 

These descriptive stats are:
- __count__: the number of datapoints (non NaN values) in the series .
- __mean__: the arithmetic mean of the series.
- __std__: the standard deviation of the series.
- __min__: the lowest value of the series (lower limit of range).
- __25%__: 25th percentile of the series.
- __50%__: 50th percentile of the series (median).
- __75%__: 75th percentile of the series.
- __max__: the highest value of the series (upper limit of range).

In [13]:
df.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


# Conditional selection
Below is an example of a condition we may want to check: which cells in the __temperature__ column are less than 30.

This returns a boolean series with True at the cells which satisfy our condition. We can see that it is True for values at index 2 and 3.

In [14]:
# Which values in 'temperature' column are greater than 30
df['temperature'] < 30

0    False
1    False
2     True
3     True
4    False
5    False
Name: temperature, dtype: bool

If we pass that condition as in index we get only the rows which satisfy our condition. In this case the rows at index 2 and 3

In [15]:
# Rows where value at 'temperature' column greater than 30
df[df['temperature'] < 30]

Unnamed: 0,day,temperature,windspeed,event
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow


Showing __rows__ of the DataFrame with the value of the ___'temperature'___ _column equal to the maximum temperature._

In [16]:
# Row where the value of 'temperature' column is the maximum value
df[df.temperature == df.temperature.max()]

Unnamed: 0,day,temperature,windspeed,event
1,01-02-17,35,7,Sunny


#### Conditional selection returns a new DataFrame
When conditionally slice a DataFrame lke this we are getting a return value of a new DataFrame with the Data that satisfies our condition. So we can slice and index this new DataFrame just like any other.

If we only want to access a specific cell of the row which meets our condition we can simple index it after the condition.

Here we are only showing the ___'day'___ when the maximum temperature was recorded.

In [17]:
# 'day' of the row where the value of 'temperature' column is the maximum value
df[df.temperature == df.temperature.max()]['day']

1    01-02-17
Name: day, dtype: object

We can also show ___multiple columns___ of the DataFrame which satisfies our condition. Just like before, we pass in a __list__ of column names.

In [18]:
# 'day' and 'temperature' of the row where the value of 'temperature' column is the maximum value
df[df.temperature == df.temperature.max()][['day', 'temperature']]

Unnamed: 0,day,temperature
1,01-02-17,35


### Conditional selection of rows and columns
With the `df.loc` we can ___slice out rows, columns and use conditions all at once!___

In [19]:
# Rows: 'temperature' > 30, Columns: only 'temperature' and 'event'
df.loc[df['temperature'] > 30, ['temperature','event']]

Unnamed: 0,temperature,event
0,32,Rain
1,35,Sunny
4,32,Rain
5,31,Sunny


We can also check multiple conditions by using the boolean operators and `&` , or `|` , xor  `^`  etc.

In [20]:
# Rows: 'temperature' > 30 and 'event' was 'Sunny', Columns: temperature and all afterwards
df.loc[(df['temperature'] > 30) & (df['event'] == 'Sunny'), 'temperature':]

Unnamed: 0,temperature,windspeed,event
1,35,7,Sunny
5,31,2,Sunny


# Setting index
If the index is not explicitly specified, by default, Pandas DataFrames are assigned in integer index starting from zero.

In [21]:
df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-17,32,6,Rain
1,01-02-17,35,7,Sunny
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow
4,01-05-17,32,4,Rain
5,01-06-17,31,2,Sunny


In [22]:
df.index

RangeIndex(start=0, stop=6, step=1)

As we can see, the index is a Pandas `RangeIndex` object similar to the regular python `range` or numpy `arange`.

We can set the index of a DataFrame by calling its `set_index` method.

In [23]:
df.set_index('day')

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-01-17,32,6,Rain
01-02-17,35,7,Sunny
01-03-17,28,2,Snow
01-04-17,24,7,Snow
01-05-17,32,4,Rain
01-06-17,31,2,Sunny


However, the `set_index` method by default returns a new DataFrame with the index set as we specified. By showing the DataFrame we can see that the index has not changed.

In [24]:
df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-17,32,6,Rain
1,01-02-17,35,7,Sunny
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow
4,01-05-17,32,4,Rain
5,01-06-17,31,2,Sunny


In [25]:
df.set_index('day', inplace=True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-01-17,32,6,Rain
01-02-17,35,7,Sunny
01-03-17,28,2,Snow
01-04-17,24,7,Snow
01-05-17,32,4,Rain
01-06-17,31,2,Sunny


Now, we can see that when we call our DataFrame the index is set to the _'day'_ column. The DataFrame has been changed ___inplace___.

### Indexing with custom index
For custom indexes, normally indexing or slicing can lead to errors.

A more reliable way is the DataFrame's `loc` method.

In [26]:
df.loc['01-01-17']

temperature      32
windspeed         6
event          Rain
Name: 01-01-17, dtype: object

### Slicing with custom index

In [27]:
df.loc['01-03-17':'01-06-17']

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-03-17,28,2,Snow
01-04-17,24,7,Snow
01-05-17,32,4,Rain
01-06-17,31,2,Sunny


### Resetting index
If we want we can reset the index back to the default.

In [28]:
df.reset_index(inplace=True)
df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-17,32,6,Rain
1,01-02-17,35,7,Sunny
2,01-03-17,28,2,Snow
3,01-04-17,24,7,Snow
4,01-05-17,32,4,Rain
5,01-06-17,31,2,Sunny
