# Basics of data analysis using pandas

This tutorial is highly motivated by the notebook of `lakshanagv`  [Complete-guide-to-data-analysis-using-Python---IMDB-movies-data](https://github.com/lakshanagv/Complete-guide-to-data-analysis-using-Python---IMDB-movies-data/blob/main/Quick%20guide%20to%20Data%20Analysis%20using%20Pandas.ipynb). Many thanks for the comprehensive notebook!

In [None]:
import pandas as pd
import numpy as np

## Pandas basic data structures

Pandas knows two basic data types `Series` and `DataFrame`. 

| Pandas Series   | Pandas DataFrame                                             |
| --------------- | ------------------------------------------------------------ |
| Format          | One-dimensional                                              | Two-dimensional |m
| Data Types      | Homogeneous - Series elements must be of the same data type. | Heterogeneous - DataFrame elements can have different data types.                              
| Number of Items | Resizable - Once created, the size cannot be changed.        | Resizable - Items can be deleted or added to an existing DataFrame. 

In [None]:
# series with random numbers using "numpy"
s_1 = pd.Series(np.random.randn(5))

# random numbers with index
s_2 = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

# more realistic ... time series (the data is not a data actually)
data = {"09:28:52.419": 1016, "09:28:52.430": 1017, "09:28:52.441": 1017}
s_3 = pd.Series(data)

In [None]:
df_1 = pd.DataFrame(np.random.randn(6, 4))

df_2 = pd.DataFrame(
    {
        "A": True,
        "B": pd.date_range("20230101", periods=4),
        "C": pd.Series(np.random.randn(4)),
        "D": np.random.randint(16, size=4),
        "E": pd.Categorical(["A", "A", "B", "C"]),
        "F": "foo",
    }
)

## Evaluating data set

### Importing data

<span style='background :yellow' > 1. Task: *Read the csv file containing the movie data set (`ÌMDB-Movie-Data.csv`) in a pandas data frame.* </span>

In [None]:
# Inspect the data (head, columns, shape, tail)


In [None]:
# describe, info


In [None]:
# unique values


In [None]:
# min, max values


<span style='background :yellow' > 2. Task: *Investigate the `value_count()` method of Python and identify the `Director` with the largest count.* </span>

### Selection of data

In [None]:
# Indexing columns as Series and Dataframe


In [None]:
# Subsetting (loc, iloc)


In [None]:
# conditional filtering


<span style='background :yellow' > 3. Task: *What is the title of the lowest rated movie? Which movie receives the highest rates?* </span>

<span style='background :yellow' > 4. Task: *Generate a new data frame referencing `Title`, `Rating` and `Year` for all movies belonging to `Action,Comedy,Drama` genre. </span>

<span style='background :yellow' > 5. Task: *Find all movies from `Ridley Scott`. List year, title and rating, sorted by year. </span>

### Group operations

In [None]:
# Basic grouping (mean Rating) (count Movies)


In [None]:
df.groupby(["Director", "Year"])[["Title"]].count().unstack().fillna("")

### Individual functions 

In [None]:
df.Actors.str.split(',').explode().value_counts()

Find all movies with "Christian Bale" as actor!

Another example

In [None]:
def color_if_2016(val):
    color = 'red' if val ==2016 else 'black'
    return 'color: %s' % color

df[['Year', 'Title']].head(15).style.applymap(color_if_2016)

<span style='background :yellow' > 6. Task: *Extend the function and include the 10 most active actors to the evaluation. </span>

<span style='background :yellow' > 7. Task: *Apply the `rating_group` function to the Rating column. Identify the Director with the highest count on 'Good' classifications </span>

In [None]:
# Classify movies based on ratings
def rating_group(rating):
    if rating >= 7.5:
        return 'Good'
    elif rating >= 6.0:
        return 'Average'
    else:
        return 'Bad'