# Data Transformation with Pandas

Pandas is a popular Python package for data transformation. It is open source (BSD-licensed). It provides efficient, easy-to-use data transformation and analusis tools. In the following section, we will show you how to use Pandas to transform and analyze tabular data.  

------------------------------------------------------------------------------  

We import numpy and pandas as follows, we assign alias of packages after the term "as" so we don't need to type in the full name of the package every time we use it  

In [None]:
import numpy as np
import pandas as pd

With Pandas, we can read tables of different format (csv. json, parquet, ...)  
  
In our case, we are reading a table in csv. The table "Papers" is a part of the Microsoft Academic Graph. The table includes publication information, such as title, publish year, publisher, ...  

In [None]:
Papers = pd.read_csv('~/datasets/s4/MAG/Papers.csv')

By calling:   
```python
Papers.head()
```
we can view the first five lines of the datast

In [None]:
Papers.head()

By calling:   
```python
Papers.tail()
```
we can view the last five lines of the datast

In [None]:
Papers.tail()

We can also assign the number of lines we want to view:

In [None]:
Papers.head(10)

we can check all the column names

In [None]:
Papers.columns

We cam also check the datatype of different columns in the dataframe

In [None]:
Papers.dtypes

We can get basic statistic summary

In [None]:
Papers.describe()

To select a part of the dataframe, for example, a column in a dataframe, we can either use a square bracket or a dot: 

In [None]:
Papers['PaperTitle']

In [None]:
Papers.PaperTitle

In [None]:
Papers[['PaperTitle', 'Year']]

With square bracket, we can slices the rows

In [None]:
Papers[0:3]

By Calling
```python
.sort_values()
```
We can sort a dataframe by the value of a column

In [None]:
Papers.\
    sort_values(by = 'CitationCount', 
                ascending = False)[['PaperTitle', 'CitationCount']][0:20]

We can also use
```python
.loc[]
```
to make selection by label. 
We may do that using this format:
```python
df.loc[indics, column names]
```

In [None]:
Papers.loc[1:4, 'PaperTitle']

Select multiple columns

In [None]:
Papers.loc[1:4, ['PaperTitle', 'CitationCount']]

Using
```python
.iloc[]
```
we can make selection by position(index). 
We may do that using this format:
```python
df.iloc[row position, column position]
```

In [None]:
Papers.iloc[[1, 3, 5], [2, 4, 6]]

In [None]:
Papers.iloc[1:5, 2:4]

We can also select rows by condition

In [None]:
Papers[Papers['CitationCount'] >= 100]

Most of the data we use might be dirty, in a sense that it includes a lot of NaN values. With Pandas we can either remove rows with NaNs or fill NaN with another value

In [None]:
Papers[['DocType', 'PaperTitle', 'CitationCount']].dropna()

Some time by removing rows with NaN, we might loss a lot of information. In the following example, we lost data because a lot of publication in the dataset is missing document type. Removing all the rows with NaN might bring undesired results. 

In [None]:
len(Papers[['DocType', 'PaperTitle', 'CitationCount']]), len(Papers[['DocType', 'PaperTitle', 'CitationCount']].dropna())

Instead of removing NaN, we can also fill those fields with other values:

In [None]:
Papers['DocType'].fillna('unknown')

With Pandas, we can perform simple descriptive statistic:

In [None]:
Papers.CitationCount.mean(), Papers.CitationCount.std()

Counting the frequency of different categories in a dataframe:

In [None]:
Papers.DocType.fillna('unknown').value_counts()

Sometimes we want to apply an operation on a column, we can simply call
```python
.apply(func)
```
with ```func``` being the operation in the form of Python function

In [None]:
def published_recently(s):
    if (2021 - s) <= 10:
        return(True)
    else:
        return(False)

Papers.Year.apply(published_recently)

There are many ways we can link two different dataframe. One way is to use ```pd.concat([])```

In [None]:
pd.concat([Papers.Year, Papers.Year.apply(published_recently)], axis = 1)

If we want to combine two dataframe with matching column values, we can use ```merge()```

Let's import another table:

In [None]:
PaperCitationContext = pd.read_csv('~/datasets/s4/MAG/PaperCitationContexts.csv')

Merge two dataframe matching PaperId to see what each paper is citing and the citaiton context

In [None]:
Papers.\
    merge(PaperCitationContext, how = 'inner', on = 'PaperId')[['PaperTitle', 'CitationContext']]

We might want to look at simple descriptive statistic or apply complicated operation on data broken into different groups, we can use
```python
.groupby()
```
to do that. In the following case, we will calculate the average citation count for different document type:

In [None]:
Papers.groupby('DocType')['CitationCount'].mean()

In the following case, we are counting the number of documents of different document types

In [None]:
Papers.groupby('DocType')['PaperId'].count()

# Visualization with Matplotlib

### Importing Matplotlib

Just as we use the ``np`` shorthand for NumPy and the ``pd`` shorthand for Pandas, we will use some standard shorthands for Matplotlib imports:

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

### Setting Styles

We will use the ``plt.style`` directive to choose appropriate aesthetic styles for our figures.
Here we will set the ``classic`` style, which ensures that the plots we create use the classic Matplotlib style:

In [None]:
plt.style.use('classic')

#### Plotting from an IPython notebook

- ``%matplotlib notebook`` will lead to *interactive* plots embedded within the notebook
- ``%matplotlib inline`` will lead to *static* images of your plot embedded in the notebook


In [None]:
%matplotlib inline

In [None]:
import numpy as np
x = np.linspace(0, 10, 100)

fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');

## Two Interfaces for the Price of One


In [None]:
plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));

#### Object-oriented interface

in the object-oriented interface the plotting functions are *methods* of explicit ``Figure`` and ``Axes`` objects

In [None]:
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));

## Simple Line Plots

line plot of a single function $y = f(x)$.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
fig = plt.figure()
ax = plt.axes()

x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));

create a single figure with multiple lines, we can call the ``plot`` function multiple times:

In [None]:
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));

## Adjusting the Plot: Line Colors and Styles

In [None]:
plt.plot(x, np.sin(x - 0), color='blue')        # specify color by name
plt.plot(x, np.sin(x - 1), color='g')           # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75')        # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')     # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported

In [None]:
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');

# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-')  # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':');  # dotted

In [None]:
plt.plot(x, x + 0, '-g')  # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r');  # dotted red

## Labeling Plots

In [None]:
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");

In [None]:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')

plt.legend();

## Visualization with Seaborn

In [None]:
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
import pandas as pd

In [None]:
# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)

In [None]:
# Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

In [None]:
import seaborn as sns
sns.set()
# same plotting code as above!
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

### Pair plots
 *pair plots* on iris data

In [None]:
iris = sns.load_dataset("iris")
iris.head()

In [None]:
sns.pairplot(iris, hue='species', height=2.5);

### Faceted histograms

In [None]:
tips = sns.load_dataset('tips')
tips.head()

In [None]:
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));

### Factor plots

In [None]:
with sns.axes_style(style='ticks'):
    g = sns.catplot(x="day", y="total_bill", hue="sex", data=tips, kind="box")
    g.set_axis_labels("Day", "Total Bill");

We import numpy and pandas as follows, we assign alias of packages after the term "as" so we don't need to type in the full name of the package every time we use it  