<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week2/Basic_Pandas_Load_File.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Pandas operations

**Goal**: Our goal here is to learn how to load a dataset into a Pandas DataFrame. The dataset can come for example in CSV or in JSON format. We will see also how to perform basic data manipulations and very basic data visualizations so that you understand the nature of your data.

## 1. Loading a dataset in CSV format

First you have to import the `pandas` package.

In [1]:
import pandas as pd # press shift+enter to execute for Mac (Ctrl+enter for Windows)

Now you can see how to autocomplete your code with functions that are included in `pandas`. Eg type `pd.read` and see that it recommends some functions.

Let's load a CSV file from the [data folder](https://github.com/michalis0/DataMining_and_MachineLearning/tree/master/week2/data) of the Git repository folder for week 2. For example the pandas_tutorial_read.csv file. Select the data file and then click on `Raw` to obtain the link for the code below.

In [2]:
# let's load a file
data = pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week2/data/pandas_tutorial_read.csv') 
data.head()

Unnamed: 0,2018-01-01 00:01:01;read;country_7;2458151261;SEO;North America
0,2018-01-01 00:03:20;read;country_7;2458151262;...
1,2018-01-01 00:04:01;read;country_7;2458151263;...
2,2018-01-01 00:04:02;read;country_7;2458151264;...
3,2018-01-01 00:05:03;read;country_8;2458151265;...
4,2018-01-01 00:05:42;read;country_6;2458151266;...


Is the above correct? Most likely not. We see there are ';' and the data seem to be in one column. The default delimiter in the `pd.read_csv()` function is comma',' so we need to change it to ';'.

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week2/data/pandas_tutorial_read.csv', delimiter=';') 
data.head()

Unnamed: 0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
0,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
1,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
2,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
3,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America
4,2018-01-01 00:05:42,read,country_6,2458151266,Reddit,North America


This looks better. But something else does not look good now: our data frame misses column/variable names. This information can usually be derived from the documentation and we can add it by using the `names` attribute below.

In [4]:
data = pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week2/data/pandas_tutorial_read.csv', 
                   delimiter=';', 
                   names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic']) 
data.head()

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America


### A first look at the pandas data frame

With the `data.head()` function you see the first 5 lines. You can also check:

- the whole dataset: just type ```data```
- the last 5 rows with ```data.tail()``` or 
- a random sample such as ```data.sample(5)```

Try it out below:

In [5]:
data.sample(5)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
1258,2018-01-01 17:00:20,read,country_4,2458152519,AdWords,Asia
1003,2018-01-01 13:32:45,read,country_8,2458152264,SEO,South America
139,2018-01-01 02:02:41,read,country_8,2458151400,Reddit,Europe
1659,2018-01-01 22:27:20,read,country_4,2458152920,Reddit,Asia
587,2018-01-01 07:57:17,read,country_7,2458151848,AdWords,Africa


## DataFrame components
There are three components of a DataFrame: 
- the index, 
- columns and 
- data (values). 

We can store each of these components into separate variables. Let's do that and then inspect them:

In [6]:
index = data.index
columns = data.columns
values = data.values

In [7]:
index

RangeIndex(start=0, stop=1795, step=1)

In [8]:
columns

Index(['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'], dtype='object')

In [9]:
values

array([['2018-01-01 00:01:01', 'read', 'country_7', 2458151261, 'SEO',
        'North America'],
       ['2018-01-01 00:03:20', 'read', 'country_7', 2458151262, 'SEO',
        'South America'],
       ['2018-01-01 00:04:01', 'read', 'country_7', 2458151263,
        'AdWords', 'Africa'],
       ...,
       ['2018-01-01 23:59:36', 'read', 'country_6', 2458153053, 'Reddit',
        'Asia'],
       ['2018-01-01 23:59:36', 'read', 'country_7', 2458153054,
        'AdWords', 'Europe'],
       ['2018-01-01 23:59:38', 'read', 'country_5', 2458153055, 'Reddit',
        'Asia']], dtype=object)

## Data types of the Pandas Data Frame components

In [10]:
type(index)

pandas.core.indexes.range.RangeIndex

In [11]:
type(columns)

pandas.core.indexes.base.Index

In [12]:
type(values)

numpy.ndarray


The index and the columns are the same type: a pandas **`Index`** object (**`RangeIndex`** is of type **`Index`**), which is a sequence of labels for either the rows or the columns.

The values are a NumPy **`ndarray`**, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy.

## General information about the data

Using the `info()` method you can obtain a concise summary of the data, including the data types under which each column has been saved, here object (or string) and integer for user_id.

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   my_datetime  1795 non-null   object
 1   event        1795 non-null   object
 2   country      1795 non-null   object
 3   user_id      1795 non-null   int64 
 4   source       1795 non-null   object
 5   topic        1795 non-null   object
dtypes: int64(1), object(5)
memory usage: 84.3+ KB


You can also list the data types of each column by using the `dtypes` property.

In [14]:
data.dtypes

my_datetime    object
event          object
country        object
user_id         int64
source         object
topic          object
dtype: object

### Selecting columns

If you want to select some particular columns from the data frame you can do it like this:

```data[['country', 'user_id']]``` 

also possible to use a different order: 

```data[['user_id', 'country']]```.

The way to remember the syntax is that outer brackets signify that you want to select columns, and the inner brackets are for the list of columns itself.

Try it out.

In [15]:
data[['user_id', 'source', 'country']]

Unnamed: 0,user_id,source,country
0,2458151261,SEO,country_7
1,2458151262,SEO,country_7
2,2458151263,AdWords,country_7
3,2458151264,AdWords,country_7
4,2458151265,Reddit,country_8
...,...,...,...
1790,2458153051,AdWords,country_2
1791,2458153052,SEO,country_8
1792,2458153053,Reddit,country_6
1793,2458153054,AdWords,country_7


In [16]:
type(data[['user_id', 'source', 'country']])

pandas.core.frame.DataFrame

The above returns a pandas.DataFrame. If you want to return a pandas.Series instead then you can use this syntax:

```data.user_id ```

or 

``` data['user_id'] ```

In [17]:
data.user_id

0       2458151261
1       2458151262
2       2458151263
3       2458151264
4       2458151265
           ...    
1790    2458153051
1791    2458153052
1792    2458153053
1793    2458153054
1794    2458153055
Name: user_id, Length: 1795, dtype: int64

In [18]:
type(data.user_id)

pandas.core.series.Series

### Selecting rows

You can also select a few rows of your dataset using the index. For example below we select the first two rows, from 0 (included) to 2 (last value not included).

In [19]:
data[0:2]

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America


### Boolean indexing

To filter users of a certain kind, for example those that came from SEO as a source then you can write:

``` data[data.source == 'SEO'] ```

where the inner statement creates a boolean mask.

In [20]:
data[data.source == 'SEO']

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
11,2018-01-01 00:08:57,read,country_7,2458151272,SEO,Australia
15,2018-01-01 00:11:22,read,country_7,2458151276,SEO,North America
16,2018-01-01 00:13:05,read,country_8,2458151277,SEO,North America
...,...,...,...,...,...,...
1772,2018-01-01 23:45:58,read,country_7,2458153033,SEO,South America
1777,2018-01-01 23:49:52,read,country_5,2458153038,SEO,North America
1779,2018-01-01 23:51:25,read,country_4,2458153040,SEO,South America
1784,2018-01-01 23:54:03,read,country_2,2458153045,SEO,North America


### Selecting both rows and columns by name (`df.loc`) or by position (`df.iloc`)

Sometimes you need to select the values for a given set of rows and columns, like below. 

The recommended way to do this is by using either `df.loc` or `df.iloc`. The first one is label based so you need to pass it the index value of the row and the column names. The second one is interger-position based, so you need to pass it the row and column number, so for example row 0 (first row) and column 0 (the first column of the data frame, for us `my_datetime`). 

This method will provide you with a view of your data, which can be used for replacing values. This is different from chained indexing (see the section below).

In [21]:
data.loc[0, 'topic']

'North America'

In [22]:
data.loc[0:2, ['my_datetime', 'event']]

Unnamed: 0,my_datetime,event
0,2018-01-01 00:01:01,read
1,2018-01-01 00:03:20,read
2,2018-01-01 00:04:01,read


In [23]:
data.iloc[[0,2], 0:2]

Unnamed: 0,my_datetime,event
0,2018-01-01 00:01:01,read
2,2018-01-01 00:04:01,read


For example you can use the code below to change the value of an observation. This is inplace and carries forward to the data frame.

In [24]:
data.loc[0, 'topic']='USA'

In [25]:
data.head(1)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,USA


### Chaining (or chained indexing)

Selecting by row and column can also be done using a combination of selection methods as follows:

``` data[['country', 'user_id']][0:1] ```

**CAUTION**: Keep in mind that when you use chaining you work on *copies* of the original DataFrame. So if you use chaining to change data, you may observe that the original DataFrame was not changed.

In [26]:
data[['country', 'user_id']][0:1]

Unnamed: 0,country,user_id
0,country_7,2458151261


If you try to replace a value in this way, it will not carry forward to the data frame.

In [27]:
data[['country', 'user_id']][0:1].user_id=1

In [28]:
data.head(1)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,USA


You can find more documentation on this indexing and chained indexing [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

---

Now it's your turn to solve an exercise and deepen your knowledge.


<div class="alert alert-block alert-success">
    <h2>Exercise 1:</h2>


    
>Select the user_id, the country and the topic columns for the users who are from country_2, and show only the first 10 rows
</div>

In [29]:
# enter your solution here.

---
## 2. Loading JSON files

Many of the data in the Internet exists in JSON format which is a structured text format, and is very similar to a Python dictionary.

We will see how to load a JSON dataset in a Pandas DataFrame.

We will use the Citibike API that provides a real-time view of the Citibike stations in New York.
The API call at http://www.citibikenyc.com/stations/json.

In [30]:
import requests
url = 'http://www.citibikenyc.com/stations/json'
data = requests.get(url).json()
data

{'executionTime': '2016-01-22 04:32:49 PM',
 'stationBeanList': [{'altitude': '',
   'availableBikes': 7,
   'availableDocks': 32,
   'city': '',
   'id': 72,
   'landMark': '',
   'lastCommunicationTime': '2016-01-22 04:30:15 PM',
   'latitude': 40.76727216,
   'location': '',
   'longitude': -73.99392888,
   'postalCode': '',
   'stAddress1': 'W 52 St & 11 Ave',
   'stAddress2': '',
   'stationName': 'W 52 St & 11 Ave',
   'statusKey': 1,
   'statusValue': 'In Service',
   'testStation': False,
   'totalDocks': 39},
  {'altitude': '',
   'availableBikes': 33,
   'availableDocks': 0,
   'city': '',
   'id': 79,
   'landMark': '',
   'lastCommunicationTime': '2016-01-22 04:32:41 PM',
   'latitude': 40.71911552,
   'location': '',
   'longitude': -74.00666661,
   'postalCode': '',
   'stAddress1': 'Franklin St & W Broadway',
   'stAddress2': '',
   'stationName': 'Franklin St & W Broadway',
   'statusKey': 1,
   'statusValue': 'In Service',
   'testStation': False,
   'totalDocks': 33},

In [31]:
type(data)

dict

Above you see how the JSON file looks. The JSON results contain two keys: The `executionTime` and `stationBeanList`. The `stationBeanList` is a list of dictionaries, with each dictionary corresponding to a Citibike station.

In [32]:
data.keys()

dict_keys(['executionTime', 'stationBeanList'])

With Pandas we can easily convert a list of dictionaries into a DataFrame

In [33]:
import pandas
df = pandas.DataFrame(data["stationBeanList"])
df.head(5)

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark
0,72,W 52 St & 11 Ave,32,39,40.767272,-73.993929,In Service,1,7,W 52 St & 11 Ave,,,,,,False,2016-01-22 04:30:15 PM,
1,79,Franklin St & W Broadway,0,33,40.719116,-74.006667,In Service,1,33,Franklin St & W Broadway,,,,,,False,2016-01-22 04:32:41 PM,
2,82,St James Pl & Pearl St,27,27,40.711174,-74.000165,In Service,1,0,St James Pl & Pearl St,,,,,,False,2016-01-22 04:29:41 PM,
3,83,Atlantic Ave & Fort Greene Pl,21,62,40.683826,-73.976323,In Service,1,40,Atlantic Ave & Fort Greene Pl,,,,,,False,2016-01-22 04:32:33 PM,
4,116,W 17 St & 8 Ave,19,39,40.741776,-74.001497,In Service,1,19,W 17 St & 8 Ave,,,,,,False,2016-01-22 04:32:32 PM,


Let's try to understand the columns:

We notice that:

- **totalDocks** = **availableBikes** (bikes ready to rent) + **availableDocks** (how many docks are free)

To see if the data has been imported correctly, we can verify the datatypes of the columns. Pandas tries to infer the datatypes and for this case it does a pretty good job. In general, you should consider providing explicitly the datatypes of each column.

In [34]:
df.dtypes

id                         int64
stationName               object
availableDocks             int64
totalDocks                 int64
latitude                 float64
longitude                float64
statusValue               object
statusKey                  int64
availableBikes             int64
stAddress1                object
stAddress2                object
city                      object
postalCode                object
location                  object
altitude                  object
testStation                 bool
lastCommunicationTime     object
landMark                  object
dtype: object

One column that looks not parsed correctly is the **lastCommunicationTime** which is an `object` (i.e., `string`), so you may want to convert it to the `datetime` type.

<div class="alert alert-block alert-success">
    <h2>Exercise 2:</h2>


    
>Convert the **lastCommunicationTime** into a `datetime` datatype. <br>
**Hint**: Use the [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.
</div>

In [35]:
df.lastCommunicationTime.head(2)

0    2016-01-22 04:30:15 PM
1    2016-01-22 04:32:41 PM
Name: lastCommunicationTime, dtype: object

In [36]:
# Your solution here


Let's confirm that the **lastCommunicationTime** column is of type `datetime`.

### Adding a column

We can add a column `perc_full` that shows how full is each station.

In [37]:
df["perc_full"] = df['availableBikes']/df['totalDocks']
df.head(2)

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark,perc_full
0,72,W 52 St & 11 Ave,32,39,40.767272,-73.993929,In Service,1,7,W 52 St & 11 Ave,,,,,,False,2016-01-22 04:30:15 PM,,0.179487
1,79,Franklin St & W Broadway,0,33,40.719116,-74.006667,In Service,1,33,Franklin St & W Broadway,,,,,,False,2016-01-22 04:32:41 PM,,1.0


## Summary Statistics

You can also use the `describe` function of Pandas to get some general understanding of the central values and the tendencies of each column.

In [38]:
df.describe()

Unnamed: 0,id,availableDocks,totalDocks,latitude,longitude,statusKey,availableBikes,perc_full
count,509.0,509.0,509.0,509.0,509.0,509.0,509.0,506.0
mean,1438.563851,21.502947,32.779961,40.728369,-73.983516,1.015717,10.534381,0.319718
std,1334.345344,12.749035,11.3979,0.026755,0.028253,0.176773,10.171236,0.277337
min,72.0,0.0,0.0,40.678907,-74.096937,1.0,0.0,0.0
25%,346.0,12.0,24.0,40.708771,-73.997044,1.0,2.0,0.08027
50%,488.0,20.0,31.0,40.725603,-73.982614,1.0,7.0,0.255656
75%,3102.0,28.0,39.0,40.749156,-73.962644,1.0,16.0,0.510993
max,3244.0,62.0,67.0,40.787209,-73.929891,3.0,46.0,1.0


The question for the following lab next week will be **Are the values in the summary statistics what you expected them to be?**

---
## Writing the data to a CSV file

With the above, we just scratched the surface of what it means to do data processing.

After you did your basic data processing, you may want to save the DataFrame in a new CSV file, so that you don't have to repeat the same pre-processing everytime. You can use the [to_csv](https://datatofish.com/export-dataframe-to-csv/) function.

**Note**: When you use Google Colab, this file will only be saved in your temporary virtual machine space and will be deleted once your Colab instance is closed (i.e. you close the window). To access the file, click on the Files icon in the command palette on the left hand side of the web interface.

If you want to explore more permanent solutions of saving your file, see [here](https://colab.research.google.com/notebooks/io.ipynb).


In [39]:
# uncomment the following to save the file
# df.to_csv("my_new_file.csv", sep=',', index=False)