### Pandas

[Pandas](https://pandas.pydata.org/) is a Python module for data analysis and manipulation, particularly for *tabular data* and *time-series*. Pandas makes it easy to read and write data in various formats (csv, JSON, SQL, excel etc.), allows for manipulation operations such as merging, reshaping, selecting and helps with data cleaning, and data wrangling. 

Numpy provides high-performance operations on low-level data structures. Towards this end, each numpy array holds only one type of data (this data can be complex, it doesn't have to be just a number). Keeping one type of data makes many operations faster, but it's restrictive given that we collect data from many sources with different types. Our "dictionary of numpy arrays" was a poor-man's solution to this problem.

Pandas solves this with `Series` (1-dimensional labeled array) and `DataFrame` (2-dimensional, each colum is labeled and can hold different types of data, each row can be labeled as well, e.g., for temporal indexing) data structures.

(Small Note: Not all data is tabular!)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

**Pandas Series**

This data structure is built on top of 1-dimensional numpy arrays. The most obvious addition is that each entry has a name (label or `Index` in pandas terminology). Can be initialized similar to numpy arrays

In [None]:
# From lists
data = pd.Series([0.1,0.3,-0.1,1])
print(data)
print()
print(data[2])
print()
print(data[1:3])

We can use the `values` field to reach the underlying numpy array and ``index`` field to reach the *labels* of the data stored as Pandas `Index`. 

In [None]:
#Underlying array
print(type(data.values))
print(data.values)
print()

In [None]:
#Labels of the data. Since we did not specify anything, it uses its default RangeIndex
print(type(data.index))
print(data.index)

In the example above, index is a `RangeIndex` object. It is also possible to explicitly define labels and access them using these labels (What data structure does this resemble?). Index type also changes according to the data type used. 

In [None]:
data = pd.Series([0.1,0.3,-0.1,1], index = ['a','b','c','d'])

print(type(data.index))
print(data.index)
print()

In [None]:
data

In [None]:
# Access with the given labels
print(data['c'])

In [None]:
# Integer indexing still works but be careful ...
print(data[0])

In [None]:
# Slicing with the given index is possible. 
# This assumes the given order
# Important Note: The last index is included!!!!
print(data['b':'d'])
print()
print(data['d':'b':-1])
print()

In [None]:
data2 = pd.Series([0.1,0.3,-0.1,1], index = ['d','c','b','a'])
print(data2['b':'d']) 
print()
print(data2['b':'d':-1])
print()

In [None]:
# We can give our own integer labels as well
data = pd.Series([0.1,0.3,-0.1,1], index = [1,3,6,10])
print(type(data.index))
print(data.index)
print()
print(data[6])

In [None]:
# Since the given indices are integers, the below no longer works!
# print(data[0]) # uncommenting will give an error

In [None]:
# We can give our own integer labels as well
data = pd.Series([0.1,0.3,-0.1,1], index = [1.,3.,6.,10.])
print(type(data.index))
print(data.index)
print()
print(data[6.])
print(data[6])
# print(data[0]) # uncommenting will given an error

In [None]:
# You can mix and match
data = pd.Series([0.1,0.3,-0.1,1,5,4,3], index = ['a','b','c','d',1,2,5])

print(type(data.index))
print(data.index)
print()

To avoid potential ambiguity, the Series data structure has the `loc` and `iloc` fields; `loc` (location) provides access based on given labels and, `iloc` (integer location) provides access based on integer indexing (Python style)

In [None]:
data = pd.Series([0.1,0.3,-0.1,1], index = [1,3,6,10])
print(data)

In [None]:
print(data.loc[1:3]) 

In [None]:
print(data.iloc[1:3]) 

In [None]:
print(data.loc[1:]) 

In [None]:
data = pd.Series([0.1,0.3,-0.1,1], index = ['a','b','c','d'])
print(data)
print(data.loc['a':'b']) 
print(data.iloc[1:3])

Ability to give custom indices makes the Series data structure resemble Python dictionaries. It is not surprising for us to be able to directly initialize a Series data structure with a dictionary.

In [None]:
plate_dict = {'istanbul':'34', 'ankara':'06', 'kocaeli':'41', 'izmir':'35'}

plates = pd.Series(plate_dict)

print(plates.index)
print()
print(plates.values)

In [None]:
print(plates['istanbul'])
print()
print(plates['istanbul':'ankara'])
print()
print(plates.loc['istanbul':'ankara'])
print()
print(plates.iloc[0:2])

You can pick a subset of the keys from the Dictionary as well

In [None]:
plates_small = pd.Series(plate_dict, index = ['istanbul', 'ankara'])
print(plates_small.index)
print(plates_small.values)

We can remove a data point from a Series using its label with the `drop` function. 

In [None]:
plates_missing = plates.drop('izmir') #inplace=True drops it from the original but returns None
print(plates.index)
print(plates_missing.index)
print(plates_missing)

Series also has the `keys()` method (equivalent to `index`)

In [None]:
print(plates.keys())

We have multiple `foreach` options for the Series data structure

In [None]:
for plate in plates:
    print(plate)

In [None]:
# vs dictionaries which return keys 
for something in plate_dict:
    print(something)

In [None]:
for city in plates.keys():
    print(city)
    
print()

for city in plates.index:
    print(city)

In [None]:
for tuples in plates.items():
    print(tuples)

In [None]:
for city,plate in plates.items():
    print(city,plate)

Let's change gears and see NaN values

In [None]:
a=pd.Series([1, 3, 5, np.nan, 6, 8])
a

There are several NaN related functions in pandas

In [None]:
a.fillna(np.mean(a)) #write over NaN values

In [None]:
# a.fillna(np.mean(a), inplace = True)
#a

In [None]:
np.isnan(a) # Check whether values are NaN

You can replace values as well (specifying what to replace can get relatively complex if you want to)

In [None]:
a.replace(12,1)

In [None]:
a.replace([6,8],12)

In [None]:
print(a)
a.replace([6,8],12, inplace=True)
print(a)

In [None]:
print(a.values<5)
print(type(a.values<5))

In [None]:
print(a<5)
print(type(a<5))

In [None]:
a.replace(a.values[a.values<5],5)

In [None]:
a.replace(np.nan,10)

We can apply arbitrary functions on each data point

In [None]:
a.apply(np.log)

In [None]:
np.log(a)

In [None]:
# Custom function
def dummy(x):
    if x < 5:
        return 5
    else:
        return x
a.apply(dummy)

In [None]:
# dummy(a) # this will give an error since all the statements inside dummy are designed for single values

**Pandas Data Frame**

`DataFrame` data structure can be thought of as a 2D numpy array with indexed rows and columns, where each column can hold different data types. It consists of `Series` data structures with the same indices (but different data).

In [None]:
vehicle_count_dict = {'istanbul': 4187531, 'ankara':2003921, 'kocaeli':393605, 'izmir':1400809}

vehicle_count = pd.Series(vehicle_count_dict)

road = pd.DataFrame({'plate':plates,'vehicle count':vehicle_count})
print(road)

It has the following properties: ``values`` (2-dim numpy array), ``index`` (pd.Index, row names) and additionally ``columns`` (pd.Index, column names, equivalent version ``keys()``)

In [None]:
print(type(road.values))
print(road.values)
print(road.values.dtype) #object, why?

In [None]:
road.to_numpy()

In [None]:
print(type(road.index))
print(road.index)

In [None]:
print(type(road.columns))
print(road.columns)

In [None]:
print(type(road.keys()))
print(road.keys())

Dictionary type access to columns:

In [None]:
print(road['plate'])
print()
print(road['vehicle count'])

In [None]:
road

In [None]:
road.iloc[:,1]

Can be directly created from 2-dim numpy arrays.

In [None]:
x = np.random.rand(3, 2)
print(x)
print()

data = pd.DataFrame(x,
             columns=['abc', 'qwe'],
             index=['a', 'b', 'c'])

print(data)

In [None]:
print(data.values.dtype)

In [None]:
print(data.to_numpy().dtype)

One of the main strengths of Pandas is to load files into dataframes Let's start with a simple `read_csv` function call and load the student_grades.csv from matplotlib exercise. The file looks like this:

`Letter,Total,Exams,HWs`  
`D,42.50,28.18,14.32`  
`F,33.21,18.75,14.46`  
`D+,50.01,30.85,19.16`  
`F,10.28,8.78,1.50`  
`D+,47.59,32.75,14.84`  
`B,75.17,41.45,33.72`  
`B,74.82,39.23,33.79`  
`...` 

In [None]:
student_grades_df = pd.read_csv('student_grades.csv')

In [None]:
student_grades_df

Some things to notice:
* Notebook displays the first and last 5 entries by default
* The column names/labels are *inferred* 
* The row labels aka indices are integers

Let's Shift+Tab the signature! 

We will see more of this but let's continue with DataFrames

In [None]:
# The print output
print(student_grades_df)

In [None]:
# Side Note:
from IPython.display import display
display(student_grades_df)

In [None]:
# Column and row labels:
print(student_grades_df.columns)
print(student_grades_df.index)

We can look at desired number of points from the *head* and the *tail*

In [None]:
# First 10
student_grades_df.head(10) #if no integer is given, only 5 of them are displayed

In [None]:
# Last 7
student_grades_df.tail(7) #if no integer is given, only 5 of them are displayed

`DataFrame` data structure provides a method called `describe` to show a few simple descriptive statistics of the data

In [None]:
student_grades_df.describe()

Where did the letter grades go?

Let's go over **access** with `loc` and `iloc`

In [None]:
student_grades_df

`iloc`: integer location access (supports all the Pythonic stuff!)

In [None]:
# iloc[row, col] 
print(student_grades_df.iloc[0, 1])  # 0th row 1st column is Letter the Total of the first student
print(student_grades_df.iloc[1, 0])  # Letter grade of 1st student
print(student_grades_df.iloc[93, 2])  # Exam average of 93rd student

In [None]:
# iloc[row]
print(student_grades_df.iloc[3])     # All data for the 3rd student United Kindom için bütün veriler 
print()

# iloc[:, column]
print(student_grades_df.iloc[:, 1])  # all total grades (only display first and last 5)

In [None]:
# Slicing
print(student_grades_df.iloc[10:15, 0:2])  # students 10,11,12,13,14 and their letter and total grades

In [None]:
print(student_grades_df.iloc[1:10:2, ::-1])  

`loc`: location access (i.e. access with row and column names/labels/indices)

In [None]:
student_grades_df

In [None]:
# loc[row, col] 
print(student_grades_df.loc[4, 'Letter'])
print(student_grades_df.loc[91, 'HWs'])
print(student_grades_df.loc[50, 'Exams'])

In [None]:
# loc[row]
print(student_grades_df.loc[94])

In [None]:
# loc[:, col]
print(student_grades_df.loc[:, 'Total'])

In [None]:
# slicing - note the inclusion of the end even with integers!
print(student_grades_df.loc[0:10, 'Total':'HWs'])

Column access for `DataFrame`

In [None]:
print(student_grades_df.loc[:, 'Exams'])  # 'Exams' column
print()
print(student_grades_df['Exams'])        # 'Exams' column
print()
print(student_grades_df.Exams)

Dataframe Transpose

In [None]:
student_grades_df.T

Let's look at another example with named rows (transport.csv):  
`country, car, bus, rail`  
`some more explanations, yada yada yada`  
`France, 86.1, 5.3, 8.6`  
`Germany, 85.2, 7.1, 7.7`  
`Netherlands, 86.4, 4.6, 9`  
`United Kingdom, 88.2, 6.5, 5.3` 

The second row is just comments. There are spaces after the commas. It looks like the first column may actually be used as column indices. Let's incorporate these into our file reading.

In [None]:
tran = pd.read_csv('transport.csv', skiprows=[1], skipinitialspace=True, index_col=0)
tran

In [None]:
print(tran.columns)
print(tran.index)

In [None]:
print(tran.loc['France', 'bus'])
print(tran.loc['Germany', 'car'])
print(tran.loc['Netherlands', 'rail'])

In [None]:
print(tran.loc['France'])
print()
print(tran.loc['Germany'])
print()
print(tran.loc['Netherlands'])

In [None]:
tran.loc['Netherlands']

In [None]:
print(tran.loc[:, 'car'])  
print()
print(tran['car'])
print()
print(tran.car)

`DataFrame` objects have their own functions similar to numpy arrays. In addition, `DataFrame` objects can be treated as numpy arrays if their columns have appropriate dtypes. As a result, we can call numpy function on data-frames directly. 


Let's see a few examples

In [None]:
print('The most car usage among given countries:', tran['car'].max())
print('Country with most car usage:', tran['car'].idxmax()) #idxmax is equilvalnet to argmax 
print('Average car usage among given countries:', tran['car'].mean())

In [None]:
print('Average total grades:', student_grades_df['Total'].mean())
print('Median total grades:', student_grades_df['Total'].median())
print('Standard deviation of total grades:', student_grades_df['Total'].std())
print()
# Returns a dataframe since there can be items with the same count
print('Most frequent grade:', student_grades_df['Letter'].mode()) 

In [None]:
print(student_grades_df.mode())

In [None]:
print('Most frequent grade:', student_grades_df['Letter'].mode()) 

Numpy functions

In [None]:
# np.log(student_grades_df) # this will given an error, why?

In [None]:
a = student_grades_df.loc[:,'Total':'HWs']
print(type(a))

In [None]:
b = np.log(student_grades_df.loc[:,'Total':'HWs'])
b

In [None]:
student_grades_df

In [None]:
np.log(student_grades_df.drop(['Letter'],axis=1))

In [None]:
print(type(b))

In [None]:
print(type(student_grades_df['Total']))
c = np.log(student_grades_df['Total'])
c

In [None]:
print(type(c))

In class questions:
* apply with more than 1 dimension (use the same function on multiple columns or have a single function take input from multiple columns)?
* Non-unique indices
* Frequencies of the Letter column -> more generally descriptive statistics of categorical and/or ordinal data?

In [None]:
with open('transport_2.csv') as f:
    print(f.read())

In [None]:
tran_nu = pd.read_csv('transport_2.csv', index_col = 0, skiprows=[1])
tran_nu

In [None]:
tran_nu.loc['Germany']

In [None]:
print(tran.index.is_unique)
print(tran_nu.index.is_unique)

In [None]:
student_grades_df['Letter'].value_counts()

In [None]:
student_grades_df.info()

In [None]:
student_grades_df.describe(include = 'object')

In [None]:
student_grades_df['Letter'].unique()

In [None]:
student_grades_df['Exams']+student_grades_df['HWs']

We can use conditional indexing with data-frames as well

In [None]:
tran

In [None]:
tran['car'] > 86

In [None]:
tran.index[tran['car'] > 86]

In [None]:
print('Countries that have more than 86% car usage:')
print(tran.index[tran['car'] > 86].values)

In [None]:
tran.loc[tran.index[tran['car'] > 86]]

In [None]:
tran.loc[tran['car'] > 86]

In [None]:
# If it were the other way
tmp = tran.T
tmp

In [None]:
tmp.loc['car'] > 86

In [None]:
tmp.columns[tmp.loc['car'] > 86]

In [None]:
tmp[tmp.columns[tmp.loc['car'] > 86]]

**Exercises:** 
* Display all grades of students with Total > 60
* Display only letter grades of students with Total > 60
* Display all grades of students with B-,B and B+ students

In [None]:
student_grades_df.loc[student_grades_df['Total'] > 60]

In [None]:
student_grades_df.loc[student_grades_df['Total'] > 60].index

In [None]:
student_grades_df[student_grades_df['Total'] > 60]

In [None]:
student_grades_df['Letter'].loc[student_grades_df['Total'] > 60]

In [None]:
desired = (student_grades_df['Letter'] == 'B') | (student_grades_df['Letter'] == 'B-') | (student_grades_df['Letter'] == 'B+')

In [None]:
student_grades_df.loc[desired]

Adding columns to `DataFrames` and new items to `Series` objects can be done analogous to adding items to dictionaries with key-value pairs

In [None]:
tran['public_transport'] = tran['bus'] + tran['rail']
print('Country with the most public transport:', tran.public_transport.idxmax())

In [None]:
tran

In [None]:
a = tran['public_transport']
print(type(a))
a

In [None]:
a['Austria'] = 14.2
a

In [None]:
# We can similarly assign new values
a['Germany'] = 15
a

In [None]:
# Original is unchanged
tran

Some other examples:

In [None]:
tranTmp = tran.copy()
tranTmp['example1'] = np.log(tran['car']/tran['bus'])**2+2

tranTmp['example2'] = np.array([str(tran['car'][0])+'-1','b','c','d'])

tranTmp['example3'] = [(1,0),(0,1),(0,0),(1,0)]
tranTmp

In [None]:
tranTmp.info()

**Exercise:** Check if the total of `Exam` and `HWs` columns are really equal to the `Total` columns (for `student_grades_df`)

In [None]:
(student_grades_df['Exams'] + student_grades_df['HWs'] == student_grades_df['Total']).value_counts()

In [None]:
(student_grades_df['Exams'] + student_grades_df['HWs'] - student_grades_df['Total'] < 1e-2).value_counts()

**Plotting with DataFrames**  
We can use Matplotlib with Pandas to plot data in dataframe and series objects

In [None]:
# Single series
plt.plot(tran['bus'])
plt.show()

In [None]:
tran

In [None]:
# Entire frame
plt.plot(tran)
plt.show()

In [None]:
# Entire frame with a legend
plt.figure(figsize=(12,7))
plt.plot(tran,linewidth = 2.5)
plt.legend(tran.columns, fontsize = 16)
plt.tick_params(labelsize=16)
plt.show()

Pandas has its own plotting commands as well (uses matplotlib in the background as default but something else can be used if really needed, see backend)

In [None]:
tran.plot(figsize = (12,7), fontsize =16)  # all columns

In [None]:
tran['bus'].plot(kind='bar') #Single column

In [None]:
tran.plot(kind='bar')  # all columns

In [None]:
tran.T.plot(kind='bar')  # all columns of the transpose

In [None]:
tran.T.plot(kind='barh')  # all columns of the transpose, horizontally

In [None]:
tran.loc['Germany'].plot(kind='pie')

In [None]:
#Doesn't make sense since public transport is the sum of bus and rail
tran.loc['Germany','car':'rail'].plot(kind='pie')

**Sorting DataFrames**  

DataFrames provide the `.sort_values` method for sorting. This can be done inplace with `inplace=True` or a new one can be created with `inplace=False`. The `by` argument of `sort_values` let's us chose the desired column to be sorted by. We can use the `ascending` argument to set the direction of sorting. We can sort based on row values by setting the `axis` argument to 1.

In [None]:
print('Car usage from most to least:')
display(tran.sort_values(by='car', ascending=False))

In [None]:
print('Bus usage from least to most:')
display(tran.sort_values(by='bus'))

In [None]:
tran.sort_values(by='bus', inplace=True) # Nothing is returned

In [None]:
tran

In [None]:
# Sorting 
display(tran.sort_values(by='Germany',axis=1))

**Renaming** columns is possible with the `.rename` method and a dictionary (keys correspond to the current names, values correspond to the desired names)

In [None]:
display(tran)
tran.rename(columns={'bus': 'coach', 
                     'rail': 'train'}, inplace=True)
display(tran)

We can also change the name of the index column (`'country'` -> `'somewhere in Europe'`)

In [None]:
tran.index.names = ['somewhere in Europe']
display(tran)

Back to file reading (sort of): 
Write the following in a code cell and press tab: `pd.read` and `tran.to` or alternatively visit [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)


In [None]:
tran

In [None]:
tran.to_csv()

In [None]:
print(tran.to_csv())

In [None]:
tran.to_csv('transport_edited.csv')

with open('transport_edited.csv','r') as f:
    print(f.read())

In [None]:
tran.to_json()

In [None]:
tran.to_json('transport_edited.json')

In [None]:
tran

In [None]:
# The index name is lost. There are roundabout ways to keep it
tran_json = pd.read_json('transport_edited.json')
tran_json

In [None]:
student_grades_df.to_json()

In [None]:
student_grades_df.to_json('sg.json') 
sg_json=pd.read_json('sg.json')
sg_json

In [None]:
tran_mi = tran.copy()
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
mi = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
tran_mi.index=mi

In [None]:
tran_mi

In [None]:
tran_mi.loc[1]

In [None]:
tran_mi.loc[(1,'red')]

In [None]:
tran_mi.loc[(2,'red')]

In [None]:
tmp_tran = pd.read_csv('transport.csv',index_col=[0,2], skiprows= [1])
tmp_tran

In [None]:
print(tmp_tran.loc['France'])
print()
print(tmp_tran.loc[('France',5.3)])

**Exercise:** Reading the auto-mpg.data together and playing around with it together. Let's look at several columns:

``
18.0   8   307.0      130.0      3504.      12.0   70  1	"chevrolet chevelle malibu"
15.0   8   350.0      165.0      3693.      11.5   70  1	"buick skylark 320"
18.0   8   318.0      150.0      3436.      11.0   70  1	"plymouth satellite"
16.0   8   304.0      150.0      3433.      12.0   70  1	"amc rebel sst"
17.0   8   302.0      140.0      3449.      10.5   70  1	"ford torino"
15.0   8   429.0      198.0      4341.      10.0   70  1	"ford galaxie 500"
14.0   8   454.0      220.0      4354.       9.0   70  1	"chevrolet impala"
...
11.0   8   350.0      180.0      3664.      11.0   73  1	"oldsmobile omega"
20.0   6   198.0      95.00      3102.      16.5   74  1	"plymouth duster"
21.0   6   200.0      ?          2875.      17.0   74  1	"ford maverick"
19.0   6   232.0      100.0      2901.      16.0   74  1	"amc hornet"
15.0   6   250.0      100.0      3336.      17.0   74  1	"chevrolet nova"
31.0   4   79.00      67.00      1950.      19.0   74  3	"datsun b210"
26.0   4   122.0      80.00      2451.      16.5   74  1	"ford pinto"
32.0   4   71.00      65.00      1836.      21.0   74  3	"toyota corolla 1200"
25.0   4   140.0      75.00      2542.      17.0   74  1	"chevrolet vega"
16.0   6   250.0      100.0      3781.      17.0   74  1	"chevrolet chevelle malibu classic"
...
``

Things to notice:
* No header or column names. We can specify this ourselves
* There is a "?" to denote missing data
* The columns are not separated by commas but by spaces
* The last part is separated by tabs!


In [None]:
# The column names are not in the data
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin'] 

dataset_df = pd.read_csv('auto-mpg.data', 
                          names=column_names, # giving them manually
                          na_values = "?",  # missing values
                          sep=" ", #spaces
                          skipinitialspace=True, #if you do not do this, the parser interprets inbetween spaces as items
                          comment='\t') # Treat the car names as comments to get rid of them, otherwise we need regex

dataset_df

The origin is 1:USA, 2:Europe, 3:Japan. Let's add a new column for named origins

In [None]:
dataset_df['Origin']

In [None]:
labelToCountry = {1:'USA',2:'Europe', 3:'Japan'}
countryList = [labelToCountry[orig] for orig in dataset_df['Origin']]
dataset_df['Origin Names'] = countryList

# What else:
## replace
## apply

In [None]:
dataset_df

In [None]:
#Create a mapping dictionary
#origins = ['USA', 'Europe', 'Japan']
#dataset_df['Named Origin'] = [origins[x-1] for x in dataset_df['Origin']]
#dataset_df

What can we look at? Some ideas:
* Descriptive Statistics
* Missing data
* Outliers
* Number of cylinders vs origin 
* what else?

In [None]:
dataset_df.info()

In [None]:
dataset_df.describe()

In [None]:
dataset_df.describe(include='object')

In [None]:
dataset_df['Origin Names'].value_counts()/len(dataset_df['Origin Names'])

In [None]:
# Pie chart by country
dataset_df['Origin Names'].value_counts().plot(kind='pie')

In [None]:
dataset_df['Horsepower'].fillna(dataset_df['Horsepower'].mean(), inplace=True)

In [None]:
dataset_df.drop(['Weight'],axis=1).plot(kind='box', figsize=(12,7))

In [None]:
dataset_df['Weight'].plot(kind='box', figsize=(12,7))

In [None]:
# Create a plot to visually the inspect this hypothesis:
# The distribution of the number of cylinders for American cars 
# tend to have higher values that the same distribution for European and Japanese cars

# THIS CELL IS THE BAD VERSION

dataset_df[dataset_df['Origin']==1]['Cylinders'].plot(kind='hist')
dataset_df[dataset_df['Origin']==2]['Cylinders'].plot(kind='hist')
dataset_df[dataset_df['Origin']==3]['Cylinders'].plot(kind='hist')

In [None]:
carNumbers = [dataset_df[dataset_df['Origin']==1]['Cylinders'].value_counts(),
              dataset_df[dataset_df['Origin']==2]['Cylinders'].value_counts(),
              dataset_df[dataset_df['Origin']==3]['Cylinders'].value_counts()]

In [None]:
carNumbers

In [None]:
tmp = pd.concat(carNumbers,axis=1).fillna(0)
tmp.plot(kind='bar')

In [None]:
carNumbers[0].rename('USA',inplace= True)
carNumbers[1].rename('Europe',inplace= True)
carNumbers[2].rename('Japan',inplace= True)

In [None]:
finalCylinders = pd.concat(carNumbers,axis=1).fillna(0)
finalCylinders.plot(kind='bar')
plt.title('Origin vs Cylinder')

A correlation example:  
* acceleration and cylinders
* acceleration and year

In [None]:
# Let's start by plotting
plt.scatter(dataset_df['Cylinders'],dataset_df['Acceleration'])

In [None]:
plt.figure(figsize=(16,7))
for i in range(1,4):
    f = plt.subplot(1,3,i)
    f.scatter(dataset_df[dataset_df['Origin']==i]['Cylinders'],dataset_df[dataset_df['Origin']==i]['Acceleration'])

In [None]:
np.corrcoef(dataset_df['Cylinders'],dataset_df['Acceleration'])

In [None]:
plt.scatter(dataset_df['Model Year'],dataset_df['Acceleration'])

In [None]:
plt.figure(figsize=(16,7))
for i in range(1,4):
    f = plt.subplot(1,3,i)
    f.scatter(dataset_df[dataset_df['Origin']==i]['Model Year'],dataset_df[dataset_df['Origin']==i]['Acceleration'])

In [None]:
np.corrcoef(dataset_df['Model Year'],dataset_df['Acceleration'])

In [None]:
def correlationPlots(dataFrame, xname, yname, labelName = None, plotHandle = None):
    if plotHandle == None:
        plt.figure(figsize=(10,7))
        plotHandle = plt
    if labelName:
        items = dataFrame[labelName].unique()
        for i in items:
            inds = dataset_df[labelName]==i
            plotHandle.scatter(dataset_df[inds][xname],dataset_df[inds][yname])
        plt.legend(items)
    else:
        plotHandle.scatter(dataset_df[xname],dataset_df[yname])
    
    plt.xlabel(xname)
    plt.ylabel(yname)
    return np.corrcoef(dataset_df[xname],dataset_df[yname])

In [None]:
correlationPlots(dataset_df,'MPG','Weight')

In [None]:
correlationPlots(dataset_df,'MPG','Weight','Origin Names')

In [None]:
correlationPlots(dataset_df,'Displacement','Cylinders','Model Year')

In [None]:
dataset_df.columns

In [None]:
labelName = 'Origin'

if labelName:
    dataset= dataset_df.loc[:,dataset_df.dtypes != object].drop(labelName,axis=1)
else:
    dataset = dataset_df.loc[:,dataset_df.dtypes != object]

n = len(dataset.columns)
k = 1
plt.figure(figsize=(n*3,n*3))
for row in dataset.columns:
    for col in dataset.columns:
        if row == col:
            f = plt.subplot(n,n,k)
            f.boxplot(dataset_df[row])
            k += 1
        else:
            f = plt.subplot(n,n,k)
            k += 1
            correlationPlots(dataset_df,row,col,labelName,f)

In [None]:
dataset_df.loc[:,dataset_df.dtypes != object].columns

### Pandas Time Series Data

There is a Python module called [`datetime`](https://docs.python.org/3/library/datetime.html) that provides functionality to manipulate dates and times. Several important classes from this module: 
* datetime: Dates and time together, same name as the module (month, day, year, hour, second, microsecond).
* date: Dates independent of time (month, day, year).
* time: Time independent of date (hour, minute, second, microsecond).
* timedelta: A duration difference between two date, time, or datetime objects (in microsecond precision)

Note: There are some details when dealing with time (timezones, time savings, represantable years, represantable durations, resolution etc.). Instead of going over everything, we will touch upon several points. The module link given above has all these details.

In [None]:
import datetime

#MINYEAR <= year <= MAXYEAR,
#1 <= month <= 12,
#1 <= day <= number of days in the given month and year,
#0 <= hour < 24,
#0 <= minute < 60,
#0 <= second < 60,
#0 <= microsecond < 1000000

year = 2022
month = 1
day = 4
hour = 17 
minute = 30
second  = 15  
microsecond = 61283

In [None]:
print(datetime.datetime(year,month,day))
print(datetime.date(year,month,day))

In [None]:
print(datetime.datetime(year,month,day,hour))
print(datetime.time(hour))

print(datetime.datetime(year,month,day,hour,minute,second,microsecond))
print(datetime.time(hour,minute,second,microsecond))

In [None]:
fullDt1 = datetime.datetime(year,month,day,hour,minute,second,microsecond)

print(type(fullDt1))
print(fullDt1)
print(fullDt1.year,fullDt1.month,fullDt1.minute)

In [None]:
year2 = 2021
month2 = 11
day2 = 8
hour2 = 12 
minute2 = 21
second2  = 47  
microsecond2 = 8761

fullDt2 = datetime.datetime(year2,month2,day2,hour2,minute2,second2,microsecond2)

td = fullDt1 - fullDt2
print(type(td))
print(td.resolution, type(td.resolution))

In [None]:
print(td)
print(td.days,td.seconds,td.microseconds)

In [None]:
# print(td.minute)

In [None]:
print(td.total_seconds())

In [None]:
print(57*24*60*60+5*60*60+8*60+28+52522/1e6)

In [None]:
fullDt1.strftime('%Y-%m-%d')

In [None]:
tmp = datetime.datetime.strptime('2022-01-04','%Y-%m-%d')
print(tmp)

In [None]:
print(tmp.hour, tmp.minute)

In [None]:
# Current time
print(datetime.datetime.now())
#print(datetime.date.now())
#print(datetime.time.now())

In [None]:
import time
print(time.time(), type(time.time()))

In [None]:
print(datetime.datetime.fromtimestamp(time.time()))
print(datetime.date.fromtimestamp(time.time()))
#print(datetime.time.fromtimestamp(time.time()))

In [None]:
n = 10000
iters = 1000

#start = datetime.datetime.now()
start = time.time()
for i in range(iters):
    squaredIntLC = [x**2 for x in range(n)]
end = time.time()
#end = datetime.datetime.now()
#print((end-start).total_seconds())
print(end-start)

In [None]:
squaredIntFL = []

#start = datetime.datetime.now()
start = time.time()
for i in range(iters):
    for x in range(n):
        squaredIntFL.append(n**2)
end = time.time()
#end = datetime.datetime.now()
#print((end-start).total_seconds())
print(end-start)

Look at Python profiling if more functionality is needed.

In [None]:
np.datetime64("2022-01-01")

In [None]:
def getRandDate():
    year = np.random.randint(1970,2023)
    month = np.random.randint(1,13)
    day = np.random.randint(1,29) # to play it safe an easy
    hour = np.random.randint(0,24)
    minute = np.random.randint(0,60)
    second  = np.random.randint(0,60)  
    microsecond = np.random.randint(0,1000000)
    
    return datetime.datetime(year,month,day,hour,minute,second,microsecond)

dates = [getRandDate() for i in range(10)]
datesString = [date.strftime('%Y-%m-%d %H:%M:%S.%f') for date in dates]

In [None]:
dates

In [None]:
datesString

In [None]:
np.array(dates,dtype=np.datetime64)

In [None]:
np.array(dates,dtype='datetime64[m]')

In [None]:
np.array(datesString,dtype=np.datetime64)

In [None]:
datetime.datetime.strptime(datesString[2],'%Y-%m-%d %H:%M:%S.%f')

In [None]:
np.array(datesString,dtype=np.datetime64)-np.array(dates,dtype='datetime64[m]')

In [None]:
tmp = np.array(datesString,dtype='datetime64[us]')

In [None]:
tmp.dtype

In [None]:
print(tmp.dtype)

Pandas is very good at representing time series as well. The way to do it is to use the index as the time stamps of the data. The data type for this Pandas's own `datetime` (or `Timestamp`) data structure. Pandas can directly parse the datetime information given as strings (there are multiple formats). We need to specify the date columns when reading the files. For example we use the `parse_dates` argument of the `read_csv` file.

Let's see an example with the following file:  
`date, conc`  
`2014-04-01, 0.19`  
`2014-04-02, 0.23`  
`2014-04-03, 0.32`  
`2014-04-04, 0.29`  
`2014-04-05, 0.32`  

In [None]:
data = pd.read_csv('timeseries1.dat', parse_dates=[0], skipinitialspace=True)
display(data)

In [None]:
data['date'].values.dtype

In [None]:
print(data['date'].values.dtype)

In [None]:
tmp = pd.read_csv('timeseries1.dat', skipinitialspace=True)
tmp['date'].values.dtype

In [None]:
type(tmp.iloc[0,0])

In [None]:
print('datetime of row 0:', data.iloc[0, 0])
print('year  of row 0:', data.iloc[0, 0].year)
print('month of row 0:', data.iloc[0, 0].month)
print('day   of row 0:', data.iloc[0, 0].day)

In [None]:
type(data.iloc[0, 0])

The resolution of the previous data was a single day. We can also represent hours, minutes and seconds as well, along with the time.

`date, conc`   
`2014-04-01 12:00:00, 0.19`   
`2014-04-01 13:00:00, 0.20`  
`2014-04-01 14:00:00, 0.23`  
`2014-04-01 15:00:00, 0.21` 

In [None]:
data = pd.read_csv('timeseries2.dat', parse_dates=[0], skipinitialspace=True)
data

In [None]:
data['date']

In [None]:
print('0th row hour:', data.iloc[0, 0].hour)
print('0th row minutes:', data.iloc[0, 0].minute)
print('0th row seconds:', data.iloc[0, 0].second)
print('0th row entire time:', data.iloc[0, 0].time())

### Indexing with Time
We can use the `datetime` objects to access and slice DataFrames

In [None]:
data = pd.read_csv('timeseries1.dat', parse_dates=[0], index_col=0)
display(data)

In [None]:
print('April 1:', data.loc['2014-04-01'])
print('April 2:', data.loc['2014-04-02'])

In [None]:
print('Between April 2 and April 4:\n', data.loc['2014-04-02':'2014-04-04'])

In [None]:
print('Older than April 3:\n', data.loc[data.index<'2014-04-03']) 

In [None]:
data.loc[(data.index<='2014-04-03') & (data.index>'2014-04-01')]

In [None]:
data = pd.read_csv('timeseries2.dat', parse_dates=[0], index_col=0)
display(data)

In [None]:
data.loc['2014-04-01 12:00:00']

In [None]:
a=data.index[1]-data.index[0]
print(a)
print(type(a))

In [None]:
a.total_seconds()

In [None]:
with open('timeseries1s.dat') as f:
    for line in f.readlines():
        print(line, end='')

In [None]:
data = pd.read_csv('timeseries1s.dat', parse_dates=[0], index_col=0)
data

In [None]:
data.loc[(data.index<='2014-04-03') & (data.index>'2014-04-01')]

In [None]:
print('Between April 2 and April 4:\n', data.loc['2014-04-02':'2014-04-04'])