# Data Manipulation and Wrangling - Pandas (Part I)
**Course:** Info 98: Essential Tools for Data Scientists, **By:** Marta Carrizo, Luke Liu

Import pandas as pd at the very beginning of the notebook!

In [2]:
import pandas as pd

## Reading Data Sources

Pandas has a number of very useful file reading tools. We'll be using read_csv today, to read top trending Youtube videos between November 2017 and June 2018.

CSV: Comma Separated Values

__Helpful Hint:__ Type "pd.re" and pressing tab to see the functions!

In [7]:
videos = pd.read_csv("videos.csv") # Make sure the file is in the same folder as the notebook
videos # if we end a cell with an expression or variable name, the result will print

2

You can also read a table from a website

In [None]:
dfs = pd.read_html("https://en.wikipedia.org/wiki/University_of_California,_Berkeley")
dfs[5].head(10)  # read the 5th table on the page

## Anatomy of a DataFrame

Head function can be used to return first few rows of a DataFrame.

In [None]:
videos.head() # Default: 5 rows

In [None]:
videos.head(3)

Similarly, tail returns the bottom few rows of the data frame.

In [None]:
videos.tail(7)

Shape returns the number of rows and columns.

In [None]:
videos.shape

Size returns the number of cells in the dataframe.

In [None]:
videos.size

Columns returns a list of column names.

In [None]:
videos.columns

Values returns an array of lists.

In [None]:
videos.values

## Utility Operations

Find the top trending videos with the most likes:

In [None]:
videos.sort_values(["likes"], ascending = False).head()

Similarly, you can also find the top trending videos with the most dislikes:

In [None]:
videos.sort_values(["dislikes"], ascending = False).head()

Find the top trending videos with the most views:

In [None]:
videos.sort_values(["views"], ascending = False).head()

We can also find the top trending videos with the longest trending number of days:

In [None]:
videos.sort_values(["num_trend_days"], ascending = False).head()

We can rename columns:

In [None]:
videos.rename(columns={"video_id": "id"}).head()

**Note:** the `rename` method returned a new dataframe and didn't modify the original one.  

**Most methods in Pandas are non-mutating.** If you change something, it should be stored in a new variable.

In [None]:
videos.head()

In [None]:
videos = videos.rename(columns={"video_id": "id"})
# or videos = videos.rename(columns={"video_id": "id"}, inline = True)
videos.head()

Astype allows you to convert columns into other types:

In [None]:
videos.astype({"latest_trend_year": float}).head()

Dtypes shows you the type of each column in the DataFrame:

In [None]:
videos.dtypes

By default a RangeIndex is attached enumerating the rows.

In [None]:
videos.index

You can take a sample of rows with sample().  When you access that sampled dataframe, it maintains the index of the rows in the original table.

In [None]:
a = videos.sample(5)
a

In [None]:
a.index

You can change the index.

In [None]:
videos_indexed = videos.set_index("id")
videos_indexed

In [None]:
videos_indexed.index

**Note:** the `set_index` method is not mutating.   

In [None]:
videos.index

In [None]:
videos.head()

**Note:** The index does not need to be unique.

The Columns are also an index:

In [None]:
videos.columns

## Indexing / Slicing

How do we access rows and columns of a DataFrame?

### Accessing Columns using `[]`

The first option to access columns is to use the operator [ ].

In [None]:
videos.head()

You can pass in a list of column names to get only those columns:

In [None]:
videos[["title", "likes", "dislikes"]].head()

If you pass a list with a single element, you get back a `DataFrame`.

In [None]:
videos[["title"]].head()

If you pass a single item instead of a list, you get back a `Series`.

In [None]:
titles = videos["title"]
titles.head()

This is a `pd.Series` object.

In [None]:
type(titles)

The `Series` object has an `index`, a `name`, and `values`.

In [None]:
titles.index

In [None]:
titles.name

In [None]:
titles.values

We can convert a Series into a DataFrame.

In [None]:
titles.to_frame().head()

Series has a function `.value_counts()` which gives the number of occurences of each unique value.

In [None]:
month_counts = videos['latest_trend_month'].value_counts()
month_counts

In [None]:
month_counts.index

In [None]:
month_counts.values

In [None]:
month_counts[5]

### Accessing rows and columns by `.loc`

In [None]:
videos.head()

Another way to select only some rows and columns is to use `loc`

In [None]:
videos.loc[0:10, ["id", "title", "views"]]

We can use `:` to indicate that we keep all rows or columns, dependeing on the dimension that we are using it

In [None]:
videos.loc[[5,10,8], :]

If we only use one argument for `loc`, it will assume that it corresponds to the selections for rows. All columns will be selected by default.

In [None]:
videos.loc[[5, 10, 8]]

Slicing for loc is inclusive! Keep this in mind when comparing to iloc later.

In [None]:
videos.loc[0:6, 'title':'dislikes'] #Starts at 0 and ends at 6

To get a Series:

In [None]:
videos.loc[:, 'channel_title'].head()

To get a DataFrame:

In [None]:
videos.loc[:, ['channel_title']].head()

To get a single value:

In [None]:
videos.loc[1, 'channel_title']

### Accessing rows and columns by `iloc` (Integer Location)

`iloc` is very similar to loc, but is used to access based on numerical positions rather than index names.

`iloc` slicing is exclusive, but `loc` is inclusive

In [None]:
videos.head()

In [None]:
videos.iloc[0:3, 0:3]

We can also use `:` to indicate that we want all rows or columns

In [None]:
videos.iloc[0:3, :]

The output type follows the same criteria as `loc`

In [None]:
videos.iloc[[0], [1]] #What is the output type?

In [None]:
videos.iloc[[0], ['title']] #What happens if I try this?

In [None]:
videos.iloc[0, [1]] #What is the output type?

In [None]:
videos.iloc[0, 1] #What is the output type?

### Adding and removing columns

Add and modify columns using the square brackets `[]`.

Contrary to the pandas functions, this is actually mutating!

In [None]:
tmp = videos.copy() # create a copy of the dataframe
tmp["views"] = tmp["views"] * 10
tmp.head()

Adding a new column:

In [None]:
tmp["likes_dislikes_ratio"] = tmp["likes"]/tmp['dislikes']
tmp.head()

Removing a column:

In [None]:
tmp.drop("likes_dislikes_ratio", axis = 1).head(5)

### Filtering rows

The `.loc[]` and also `[ ]` support arrays of booleans as an input. The array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [None]:
videos.head()

In [None]:
y2017 = videos['latest_trend_year'] == 2017
y2017.head(15)

A boolean Series can be used as an argument to the [ ] operator.

In [None]:
videos[y2017].head()

A more common and efficient way to accomplish this is:

In [None]:
videos[videos['latest_trend_year'] == 2017].head()

#### Logic Operators

`&` (and)

In [None]:
videos[
    (videos['latest_trend_year'] == 2017) & 
    (videos['likes'] > 2000000)
] # You must have the brackets!

`~` (not)

In [None]:
videos[
    (videos['latest_trend_year'] == 2017) & 
    ~(videos['likes'] <= 2000000)
]

`|` (or)

In [None]:
videos[
    (videos['latest_trend_year'] == 2017) | 
    (videos['likes'] > 300000)
].head()

## NaN Values

These are values that are undefined in the dataset.

Observe that the likes and dislikes values are undefined when the ratings are disabled.

In [None]:
videos.head()

In [None]:
videos[videos["ratings_disabled"] == True]

In [None]:
videos.astype({'likes': int, 'dislikes': int}) #What would this return??

How do we deal with these values? An easy way out would be dropping them, but they might have statistical consequences.

In [None]:
dropped = videos.dropna()
dropped.head()

In [None]:
dropped[dropped["ratings_disabled"] == True]

This may result in significant data beign lost. Another way to deal with it would be to fill those NaN with other values.

In [None]:
filled = videos.fillna(0)
filled[filled["ratings_disabled"] == True]

Now, we are able to convert those values to ints!

In [None]:
filled.astype({'likes': int, 'dislikes': int})

## Acknowledgement
The dataset in this demo is taken from Kaggle, and some of the ideas in this demo are referenced from Data 100 Lectures.