<a href="https://colab.research.google.com/github/RodrigoSalles/Big-Data-and-Cloud-Computing---Colab/blob/master/BDCC_Example_Data_Processing_Using_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example data processing using the pandas API
  

**[Big Data and Cloud Computing](https://www.dcc.fc.up.pt/~edrdo/aulas/bdcc), Eduardo R. B. Marques, DCC/FCUP**

To illustrate some simple data processing in Colab,we will make use of the popular [pandas](https://pandas.pydata.org/) library, in particular the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) API.


  


In [0]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 

## References

- [Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html), chapter 3 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) (freely available online) by [Jake VanderPlas](http://vanderplas.com/).
- Official documentation
 - [Getting started](https://pandas.pydata.org/docs/getting_started/) guide:
   - [10 minutes to pandas](https://pandas.pydata.org/docs/getting_started/10min.html)
   - [Introduction to DataFrame](https://pandas.pydata.org/docs/getting_started/dsintro.html#dataframe)
  - [Complete API Reference](https://pandas.pydata.org/docs/reference/index.html), including the features we make use of here:
    - [DataFrame API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
    - [Input/Ouput](https://pandas.pydata.org/docs/reference/io.html)


## The MovieLens data set

We will use a small [MovieLens](https://grouplens.org/datasets/movielens/) data set from the [GroupLens project](https://grouplens.org/). 

The following commands will fetch the
data set and store it into a folder in the `/tmp` directory of the Colab runtime.

Mount your Google Drive so that data can be saved to `/content/drive/My Drive/bdcc/lab2`, then set the `use_google_drive` to `True` below.


In [0]:
use_google_drive=True
if (use_google_drive):
  root = '/content/drive/My Drive/bdcc/lab2'
else:
  root = '/tmp/bdcc/lab2'

!mkdir -p "$root"
!curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o "$root/ml-latest-small.zip"
!cd "$root" && unzip -o ml-latest-small.zip 
!ls "$root/ml-latest-small"

### A brief look at the data

In [0]:
dataset_path = root + '/ml-latest-small/'

#### Movies

In [0]:
# Load movie data frame from the CSV file
movies = pd.read_csv(dataset_path + 'movies.csv') 
movies

#### Tags

In [0]:
tags = pd.read_csv(dataset_path + 'tags.csv')
tags

#### Ratings

In [0]:
ratings = pd.read_csv(dataset_path + 'ratings.csv')
ratings

## Some data processing examples

#### Simple filtering

What movie has id 1234 ? 

In [0]:
m1234 = movies[movies.movieId == 1234]
m1234


What movie has "Star Wars" in the title?

In [0]:
star_wars_movies = movies[movies.title.str.contains('Star Wars')]
star_wars_movies

### Global rating rata


Let's now calculate a few stats 

In [0]:
ratings[['rating']].agg(['min','max','mean','std'])

Let us plot the global rating distribution using a [Seaborn](https://seaborn.pydata.org/) plot.

In [0]:
import seaborn as sb

axes = sb.countplot(data=ratings, x='rating')

Let us  derive the same numbers that are shown in the plot ourselves.

In [0]:
counts = ratings.groupby('rating').rating.agg(['count'])
counts

### Order movies by average rating

First derive the average ratings.

In [0]:
avg_ratings_by_movie = ratings.groupby('movieId').rating.agg(['count','mean'])
avg_ratings_by_movie

The information is more human-readable if we join in the movie information and sort the data by average rating and movie title.

In [0]:
join_data = avg_ratings_by_movie.join(movies.set_index('movieId'))
sorted_data = join_data.sort_values(by=['mean','title'], ascending=[False,True])
sorted_data

Suppose we now want to filter out films with less than 5 ratings and show the top 20 of those movies.

In [0]:
sorted_data.query('count >= 5').head(20)

## Exercises



1.   Analyse  the `tags` data frame. What are the  10 most used tags?
2.   For the most used tag identify the movies that have been tagged with it. Display the movie title information.
3.   What are the 10 movies that have been tagged more often? 
4.    At the other extreme, are there movies with no tags?


