# <font color='red'>Tutorial 3 - Pandas Basics</font>

**Pandas** is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. **Dataframe** is the main object in pandas. It's used to represent data with rows and columns (tabular or excel spreadsheet like data). Dataframe uses the **Series** object to represent each row or column. The **Series** object is very similar to a Numpy array, the major difference is that it includes an explicit index.

In [1]:
import numpy as np
import pandas as pd

## Creating a Dataframe

In most cases we will create a dataframe by importing a CSV (Comma Separated Values) file, but now to get fimiliar with Pandas we will use a small dictionary:

In [2]:
weather_data = {
    'Day' : ['1/1/2017', '2/1/2017', '3/1/2017', '4/1/2017', '5/1/2017', '6/1/2017'],
    'Temperature' : [12, 16, -2, -3, 8, 14],
    'Windspeed' : [60, 70, 25, 75, 45, 20],
    'Event' : ['Rain', 'Sunny', 'Snow', 'Snow', 'Rain', 'Sunny']
}

In [3]:
print(type(weather_data))
print(type(weather_data['Day']))
weather_data

<class 'dict'>
<class 'list'>


{'Day': ['1/1/2017',
  '2/1/2017',
  '3/1/2017',
  '4/1/2017',
  '5/1/2017',
  '6/1/2017'],
 'Temperature': [12, 16, -2, -3, 8, 14],
 'Windspeed': [60, 70, 25, 75, 45, 20],
 'Event': ['Rain', 'Sunny', 'Snow', 'Snow', 'Rain', 'Sunny']}

Now we can use the `pd.Dataframe()` method to initiate a dataframe from the dictionary:

In [4]:
df = pd.DataFrame(weather_data)
print(type(df))
print(df)
df

<class 'pandas.core.frame.DataFrame'>
        Day  Temperature  Windspeed  Event
0  1/1/2017           12         60   Rain
1  2/1/2017           16         70  Sunny
2  3/1/2017           -2         25   Snow
3  4/1/2017           -3         75   Snow
4  5/1/2017            8         45   Rain
5  6/1/2017           14         20  Sunny


Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


## Dataframe Series Object

Each column in a dataframe is actually a pandas Series object:

In [5]:
type(df['Temperature'])

pandas.core.series.Series

In [6]:
df['Temperature']

0    12
1    16
2    -2
3    -3
4     8
5    14
Name: Temperature, dtype: int64

Series object contains an index and a set of values, to access the values we can use the `values` attribute

In [7]:
df['Temperature'].index

RangeIndex(start=0, stop=6, step=1)

Notice that this is a Numpy array

In [8]:
type(df['Temperature'].values)

numpy.ndarray

Similar to a Numpy array, we have built-in functions to run on a Series object such as max, min, mean, std, etc...

In [9]:
df['Temperature'].max()

16

In [10]:
df['Temperature'].mean()

7.5

In [11]:
df['Temperature'].std()

8.191458966508957

## Dataframes Basics

We can use the `head()` method to get the first n rows of the dataframe - the default is n=5:

In [12]:
df.head()

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain


In [13]:
df.head(3)

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow


Similarly, we can use the `tail()` method to get the last n rows of the dataframe - the default is n=5: 

In [14]:
df.tail()

Unnamed: 0,Day,Temperature,Windspeed,Event
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


In [15]:
df.tail(3)

Unnamed: 0,Day,Temperature,Windspeed,Event
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


We can use the `shape` attribute to get the dimensionality of the dataframe:

In [16]:
df.shape

(6, 4)

We can view all the Dataframe columns by using the `columns` attribute:

In [17]:
df.columns

Index(['Day', 'Temperature', 'Windspeed', 'Event'], dtype='object')

And the types of these columns by using the `dtypes` attribute:

In [18]:
df.dtypes

Day            object
Temperature     int64
Windspeed       int64
Event          object
dtype: object

We can get more information of 'object' types by using the `unique()` method. It returns a NumPy array holding the distinct values of the column:

In [19]:
df['Event'].unique()

array(['Rain', 'Sunny', 'Snow'], dtype=object)

The `value_counts()` method returns a Series object containing the counts of each value of an object type: 

In [20]:
df['Event'].value_counts()

Rain     2
Sunny    2
Snow     2
Name: Event, dtype: int64

## Accessing Columns

We can access a specific column by either using `df.column_name` or `df['column_name']`.

In [21]:
df.Day

0    1/1/2017
1    2/1/2017
2    3/1/2017
3    4/1/2017
4    5/1/2017
5    6/1/2017
Name: Day, dtype: object

In [22]:
df['Day']

0    1/1/2017
1    2/1/2017
2    3/1/2017
3    4/1/2017
4    5/1/2017
5    6/1/2017
Name: Day, dtype: object

In [23]:
type(df['Day'])

pandas.core.series.Series

In [24]:
df['Day'][1]

'2/1/2017'

We can access multiple column by passing a list of columns, in this case a new Dataframe is returned: 

In [25]:
df[['Day','Event']]

Unnamed: 0,Day,Event
0,1/1/2017,Rain
1,2/1/2017,Sunny
2,3/1/2017,Snow
3,4/1/2017,Snow
4,5/1/2017,Rain
5,6/1/2017,Sunny


## Accessing Rows

In order to access rows in a Dataframe we mainly use the loc and iloc indexers.

### Integer Location (iloc) Indexer

`iloc` allows us to access rows by integer location, it returns a Series object.

In [26]:
df.iloc[0]

Day            1/1/2017
Temperature          12
Windspeed            60
Event              Rain
Name: 0, dtype: object

In [27]:
type(df.iloc[0])

pandas.core.series.Series

We can select multiple rows by passing a list of indexes:

In [28]:
df.iloc[[1, 2, 5]]

Unnamed: 0,Day,Temperature,Windspeed,Event
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
5,6/1/2017,14,20,Sunny


or by using slicing:

In [29]:
df.iloc[2:5]

Unnamed: 0,Day,Temperature,Windspeed,Event
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain


We can select columns as well by passing it as a second argument: 

In [30]:
df.iloc[[1, 2, 5], [0, 3]]

Unnamed: 0,Day,Event
1,2/1/2017,Sunny
2,3/1/2017,Snow
5,6/1/2017,Sunny


### loc Indexer

With `iloc` we were searching using integer location, with `loc` we are going to search by the label name of the row and column. In our example, the rows names are integers, and the column names are strings.

In [31]:
df.loc[0]

Day            1/1/2017
Temperature          12
Windspeed            60
Event              Rain
Name: 0, dtype: object

In [32]:
type(df.loc[0])

pandas.core.series.Series

We can also use loc on a Series to get a specific value:

In [33]:
df.loc[0].loc['Day']

'1/1/2017'

In [34]:
df.loc[0, 'Day']

'1/1/2017'

In [35]:
df.loc[[1, 2, 5], ['Day', 'Event']]

Unnamed: 0,Day,Event
1,2/1/2017,Sunny
2,3/1/2017,Snow
5,6/1/2017,Sunny


We can use slicing here too:

In [36]:
df.loc[3:5, 'Temperature':'Event']

Unnamed: 0,Temperature,Windspeed,Event
3,-3,75,Snow
4,8,45,Rain
5,14,20,Sunny


## Dataframe Indexes

Dataframes have a default index, which is the unnamed column on the left. In the default case it is a range of numbers which is used as an identifier for the rows.

In [37]:
df

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


In [38]:
df.index

RangeIndex(start=0, stop=6, step=1)

Sometimes it might make more sense to have a different identifier (index) for each row. In our example, the day column can be used as a better index since it is a unique value for each row which also have a meaning.

In [39]:
df.set_index('Day', inplace=True)
df

Unnamed: 0_level_0,Temperature,Windspeed,Event
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,12,60,Rain
2/1/2017,16,70,Sunny
3/1/2017,-2,25,Snow
4/1/2017,-3,75,Snow
5/1/2017,8,45,Rain
6/1/2017,14,20,Sunny


In [40]:
df.loc['4/1/2017', 'Temperature']

-3

In [41]:
df.iloc[3, 0]

-3

We can reset the index back to default by using the `reset_index()` method:

<b>Note: The inplace parameter</b> - Lots of methods do not change the original dataframe, but return a new one. To apply the change on the original dataframe we have to use the `inplace=True` parameter (default is False)

In [42]:
df.reset_index(inplace=True)
df

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


## Filtering

We can exract a true and false series called a <b>filter mask</b>:

In [43]:
filt = df.Temperature >= 0
print(type(filt))
filt

<class 'pandas.core.series.Series'>


0     True
1     True
2    False
3    False
4     True
5     True
Name: Temperature, dtype: bool

We can use the filter mask to extract a new dataframe from the current dataframe:

In [44]:
df[filt]

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


We can do it in one line of code:

In [45]:
df[df.Temperature >= 0]

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


As we seen earlier, `loc` looks for columns and rows by labels, but we can also pass into `loc` a series of booleans (the filter mask):

In [46]:
df.loc[filt]

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


Using `loc` we can also specify which columns we would like to get:

In [47]:
df.loc[filt, ['Day', 'Temperature']]

Unnamed: 0,Day,Temperature
0,1/1/2017,12
1,2/1/2017,16
4,5/1/2017,8
5,6/1/2017,14


### Using And and Or operators

We can't use Python's built-in `and` and `or` keywords for our filters, instead we will use the `&` and `|` symbols.

Let's look for day/s that were sunny, but also windy:

In [48]:
filt = (df['Event'] == 'Sunny') & (df['Windspeed'] > 50) 

In [49]:
df.loc[filt, 'Day']

1    2/1/2017
Name: Day, dtype: object

Now let's look for days that are rainy or snowy:

In [50]:
filt = (df['Event'] == 'Snow') | (df['Event'] == 'Rain')

In [51]:
df.loc[filt]

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain


We can use `~` on a filter to get everything that does *not* match the filter:

In [52]:
df.loc[~filt]

Unnamed: 0,Day,Temperature,Windspeed,Event
1,2/1/2017,16,70,Sunny
5,6/1/2017,14,20,Sunny


## Modify Data in Dataframes

We can use loc and iloc to modify data inplace:

In [53]:
df

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,-3,75,Snow
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


In [54]:
df.loc[df['Day'] == '4/1/2017', ['Temperature', 'Windspeed', 'Event']] = [18, 35, 'Sunny']
df

Unnamed: 0,Day,Temperature,Windspeed,Event
0,1/1/2017,12,60,Rain
1,2/1/2017,16,70,Sunny
2,3/1/2017,-2,25,Snow
3,4/1/2017,18,35,Sunny
4,5/1/2017,8,45,Rain
5,6/1/2017,14,20,Sunny


### The Apply Method

The Apply method is used to call a function on values in our Dataframe. Apply can work on either a Dataframe or a Series object (row or column).

When using Apply on a Series object, it can apply a function to every value in the Series object. For example we can use it to transform the entire temperature column to be presented in Fahrenheit instead of in Celsius:

In [55]:
# using an annonymous function by using `lambda`
df['Temperature'].apply(lambda x: x*1.8 + 32)

0    53.6
1    60.8
2    28.4
3    64.4
4    46.4
5    57.2
Name: Temperature, dtype: float64

We can also define a function and use it in `apply`, the function will get as input each value in the `Series` object:

In [56]:
def celsius_to_fahrenheit(temp_in_celsius):
    return temp_in_celsius*1.8 + 32

In [57]:
df['Temperature'].apply(celsius_to_fahrenheit)

0    53.6
1    60.8
2    28.4
3    64.4
4    46.4
5    57.2
Name: Temperature, dtype: float64

When using `apply()` on a Dataframe object we can use the `axis` parameter to set it to work on each column (`axis='rows'`) or each row (`axis='columns'`):

In [58]:
df.apply(len, axis='rows')

Day            6
Temperature    6
Windspeed      6
Event          6
dtype: int64

In [59]:
df.apply(len, axis='columns')

0    4
1    4
2    4
3    4
4    4
5    4
dtype: int64

## Add and Remove Rows and Columns

### Add and Remove Columns

Let's add a temperature in fahrenheit to our Dataframe:

In [60]:
df['Temperature(F)'] = df['Temperature'].apply(lambda x: x*1.8 + 32)
df

Unnamed: 0,Day,Temperature,Windspeed,Event,Temperature(F)
0,1/1/2017,12,60,Rain,53.6
1,2/1/2017,16,70,Sunny,60.8
2,3/1/2017,-2,25,Snow,28.4
3,4/1/2017,18,35,Sunny,64.4
4,5/1/2017,8,45,Rain,46.4
5,6/1/2017,14,20,Sunny,57.2


And also add a humadity column:

In [61]:
df['Humidity'] = [70, 60, 50, 65, 75, 65]
df

Unnamed: 0,Day,Temperature,Windspeed,Event,Temperature(F),Humidity
0,1/1/2017,12,60,Rain,53.6,70
1,2/1/2017,16,70,Sunny,60.8,60
2,3/1/2017,-2,25,Snow,28.4,50
3,4/1/2017,18,35,Sunny,64.4,65
4,5/1/2017,8,45,Rain,46.4,75
5,6/1/2017,14,20,Sunny,57.2,65


We can also add column using the `assign()` method:

In [62]:
df = df.assign(WindDirection= ['SE', 'N', 'NW', 'SW', 'W', 'SE']) # assign can't be inplace operation
df

Unnamed: 0,Day,Temperature,Windspeed,Event,Temperature(F),Humidity,WindDirection
0,1/1/2017,12,60,Rain,53.6,70,SE
1,2/1/2017,16,70,Sunny,60.8,60,N
2,3/1/2017,-2,25,Snow,28.4,50,NW
3,4/1/2017,18,35,Sunny,64.4,65,SW
4,5/1/2017,8,45,Rain,46.4,75,W
5,6/1/2017,14,20,Sunny,57.2,65,SE


We can delete columns using the `drop()` method. Let's drop the older temperature column and the windspeed and WindDirection columns:

In [63]:
df.drop(columns=['Temperature', 'Windspeed', 'WindDirection'])

Unnamed: 0,Day,Event,Temperature(F),Humidity
0,1/1/2017,Rain,53.6,70
1,2/1/2017,Sunny,60.8,60
2,3/1/2017,Snow,28.4,50
3,4/1/2017,Sunny,64.4,65
4,5/1/2017,Rain,46.4,75
5,6/1/2017,Sunny,57.2,65


Notice that we have to use the `inplace=True` parameter if we want to make changes to the Dataframe we work with. If we don't, the `drop()` method returns a new Dataframe object

In [64]:
df.drop(columns=['Temperature', 'Windspeed', 'WindDirection'], inplace=True)
df

Unnamed: 0,Day,Event,Temperature(F),Humidity
0,1/1/2017,Rain,53.6,70
1,2/1/2017,Sunny,60.8,60
2,3/1/2017,Snow,28.4,50
3,4/1/2017,Sunny,64.4,65
4,5/1/2017,Rain,46.4,75
5,6/1/2017,Sunny,57.2,65


### Add and Remove Rows

We can use the `pandas.concat()` method to add rows to our Dataframe.

In [65]:
new_data = {
    'Day' : ['7/1/2017', '8/1/2017'],
    'Temperature(F)' : [97, 102],
    'Humidity' : [78, 85],
    'Event' : ['Sunny', 'Sunny']
}
df2 = pd.DataFrame(new_data)
df2

Unnamed: 0,Day,Temperature(F),Humidity,Event
0,7/1/2017,97,78,Sunny
1,8/1/2017,102,85,Sunny


In [66]:
pd.concat([df, df2])

Unnamed: 0,Day,Event,Temperature(F),Humidity
0,1/1/2017,Rain,53.6,70
1,2/1/2017,Sunny,60.8,60
2,3/1/2017,Snow,28.4,50
3,4/1/2017,Sunny,64.4,65
4,5/1/2017,Rain,46.4,75
5,6/1/2017,Sunny,57.2,65
0,7/1/2017,Sunny,97.0,78
1,8/1/2017,Sunny,102.0,85


We got duplicates in our index which is less preferable. Let's use the `ignore_index=True` parameter to assign the new data appropriate index numbers:

In [67]:
pd.concat([df, df2], ignore_index=True)

Unnamed: 0,Day,Event,Temperature(F),Humidity
0,1/1/2017,Rain,53.6,70
1,2/1/2017,Sunny,60.8,60
2,3/1/2017,Snow,28.4,50
3,4/1/2017,Sunny,64.4,65
4,5/1/2017,Rain,46.4,75
5,6/1/2017,Sunny,57.2,65
6,7/1/2017,Sunny,97.0,78
7,8/1/2017,Sunny,102.0,85


We can delete rows by using `drop()` and provide index number:

In [68]:
df.drop(index=4)

Unnamed: 0,Day,Event,Temperature(F),Humidity
0,1/1/2017,Rain,53.6,70
1,2/1/2017,Sunny,60.8,60
2,3/1/2017,Snow,28.4,50
3,4/1/2017,Sunny,64.4,65
5,6/1/2017,Sunny,57.2,65


We can select specific rows to remove by using a filter. For example, let's drop rows of days in which humadity was 65:

In [69]:
filt = df['Humidity'] == 65
df.drop(index=df[filt].index)

Unnamed: 0,Day,Event,Temperature(F),Humidity
0,1/1/2017,Rain,53.6,70
1,2/1/2017,Sunny,60.8,60
2,3/1/2017,Snow,28.4,50
4,5/1/2017,Rain,46.4,75


## Sort

We can use the `sort_values()` method to sort out Dataframe:

In [70]:
df.sort_values(by='Temperature(F)')

Unnamed: 0,Day,Event,Temperature(F),Humidity
2,3/1/2017,Snow,28.4,50
4,5/1/2017,Rain,46.4,75
0,1/1/2017,Rain,53.6,70
5,6/1/2017,Sunny,57.2,65
1,2/1/2017,Sunny,60.8,60
3,4/1/2017,Sunny,64.4,65


When sorting in Python the default order is ascending, if we want the values to be sorted in a descending order we need to set `ascending=False`:

In [71]:
df.sort_values(by='Temperature(F)', ascending=False)

Unnamed: 0,Day,Event,Temperature(F),Humidity
3,4/1/2017,Sunny,64.4,65
1,2/1/2017,Sunny,60.8,60
5,6/1/2017,Sunny,57.2,65
0,1/1/2017,Rain,53.6,70
4,5/1/2017,Rain,46.4,75
2,3/1/2017,Snow,28.4,50


We can also define define several columns to sort by, the order in which we enter them will be the order in which they will be sorted:

In [72]:
df.sort_values(by=['Event', 'Temperature(F)'])

Unnamed: 0,Day,Event,Temperature(F),Humidity
4,5/1/2017,Rain,46.4,75
0,1/1/2017,Rain,53.6,70
2,3/1/2017,Snow,28.4,50
5,6/1/2017,Sunny,57.2,65
1,2/1/2017,Sunny,60.8,60
3,4/1/2017,Sunny,64.4,65


# <font color=blue> Exercise </font>

In [73]:
salary_data = {
    'city' : ['Tel Aviv', 'Haifa', 'Beer Sheva', 'Beer Sheva', 'Tel Aviv', 'Haifa', 'Haifa', 'Tel Aviv'],
    'age' : [27, 35, 28, 24, 32, 41, 19, 38],
    'salary' : [27000, 29000, 25000, 24000, 32000, 45000, 16500, 48000],
}

1. Create a Dataframe from the given dictionary
2. What is the salaries mean?
3. Get data records of salaries only in 'Haifa' city.
4. Get the salary at index (row) 4. Given that it is a monthly salary, calculate the annual salary.
5. Get data of records in which age is higher than 30 and the salary is lower than 35000.
6. Get the salaries (not the entire record, only salary) in 'Tel Aviv' that are below 30000 or above 40000.
7. Add a new column to the Dataframe named 'salary_category' that contains the salary level of the employee. 'Low' if salary < 25000, 'Medium' if 25000 <= salary < 35000, and 'High' if salary > 35000
8. Add a new column to the Dataframe named 'young_talent' that is set to True if age < 30 and salary >= 25000, otherwise it is set to False. 

In [74]:
# Write your code here


### Solution