## Interpreter

Python is an interpreted language which can be used in two ways:

"Interactively": when you use it as an “advanced calculator” executing one command at a time. To start Python in this mode, execute python on the command line:

"Scripting" Mode: executing a series of “commands” saved in text file, usually with a .py extension after the name of your file:

## Introduction to variables in Python

### Assigning values to variables
One of the most basic things we can do in Python is assign values to variables:

Here we’ve assigned data to the variables `text`, `number` and `pi_value`, using the assignment operator `=`. To review the value of a variable, we can type the name of the variable into the interpreter and press `Return`:

Everything in Python has a type. To get the type of something, we can pass it to the built-in function type:

In [None]:
# Type of the text object

In [None]:
# Type of the number object

In [None]:
# Type of the pi_value object

The variable `text` is of type `str`, short for “string”. Strings hold sequences of characters, which can be letters, numbers, punctuation or more exotic forms of text (even emoji!).

We can also see the value of something using another built-in function, `print`:

This may seem redundant, but in fact it’s the only way to display output in a script.

Tip: `print` and `type` are built-in functions in Python.

### Operators
We can perform mathematical calculations in Python using the basic operators +, -, /, *, %:

In [None]:
# Addition

In [None]:
# Multiplication

In [None]:
# Power

In [None]:
# Modulo/remainder

We can also use comparison and logic operators: <, >, ==, !=, <=, >= and statements of identity such as and, or, not. The data type returned by this is called a boolean.

------------------------------------------------------
# SLIDES
------------------------------------------------------

# Getting help

------------------------------------------------------
# SLIDES
------------------------------------------------------

## Sequences: Lists and Tuples

### Lists
Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index. Note that Python indexes start with 0 instead of 1:

In [None]:
# Indentation is very important in Python.

To add elements to the end of a list, we can use the append method.

To find out what methods are available for an object, we can use the built-in help command:

### Tuples
A tuple is similar to a list in that it’s an ordered sequence of elements. However, tuples can not be changed once created (they are “immutable”). Tuples are created by placing comma-separated values inside parentheses ().

---------------------------------
# CHALLENGE 1
---------------------------------

## Dictionaries

A dictionary is a container that holds pairs of objects - keys and values.

Dictionaries work a lot like lists - except that you index them with keys. You can think about a key as a name or unique identifier for the value it corresponds to.

To add an item to the dictionary we assign a value to a new key:

Using for loops with dictionaries is a little more complicated. We can do this in two ways:

---------------------------------
# CHALLENGE 2
---------------------------------

## Functions

Defining a section of code as a function in Python is done using the def keyword. For example a function that takes two arguments and returns their sum can be defined as:

## Working With Pandas DataFrames in Python

### About Libraries
A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used or called to perform the task(s) it was built to do.

One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). 

### Reading CSV Data Using Pandas 

We can use Pandas’ read_csv function to pull the file directly into a DataFrame.

A DataFrame is a 2-dimensional data structure that can store data of different types (including strings, numbers, categories and more) in columns.

------------------------------------------------------
# SLIDES
------------------------------------------------------

We need to save the data to memory so we can work with it.
To do that, we need to assign the DataFrame to a variable.

In [None]:
# Save data to memory

In [None]:
# View the data object

In [None]:
# View the first few lines

In [None]:
# View object type

In [None]:
# View shape (dimensions)

In [None]:
# View data types

In [None]:
# View info

In [None]:
# Calculate summary statistics for all numeric columns

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

Attributes are features of an object.

Methods are like functions, but they only work on particular kinds of objects. With a method, we can supply extra information in the parentheses to control behaviour.

---------------------------------
# CHALLENGE 3
---------------------------------

Let’s perform some quick summary statistics to learn more about the data that we’re working with.

In [None]:
# View columns

In [None]:
# View unique directors

---------------------------------
# CHALLENGE 4
---------------------------------

------------------------------------------------------
# SLIDES
------------------------------------------------------

## Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data.

We can calculate basic statistics for all records in a single column using the syntax below:

We can also extract one specific metric if we wish:

But if we want to summarize by one or more variables, we can use Pandas’ `.groupby` method.

In [None]:
# Group data by director

In [None]:
# Summary statistics for all numeric columns by director

In [None]:
# Provide the mean for each numeric column by director

---------------------------------
# CHALLENGE 5
---------------------------------

Let’s next count the number of movies for each year. We’ll use `groupby` combined with a `count()` method.

We can also count just the rows that have the genre "Thriller":

## Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.

---------------------------------
# CHALLENGE 6
---------------------------------

------------------------------------------------------
# SLIDES
------------------------------------------------------

## Indexing and Slicing in Python

We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.

### Selecting data using Labels (Column Headings)

We use square brackets [] to select a subset of a Python object.

We can also create a new object that contains only the data within the `original_title` column as follows:

We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.

### Extracting Range based Subsets: Slicing

Python uses 0-based indexing. This means that the first element in an object is located at position 0. 

In [None]:
# Create a list of numbers

In [None]:
# Indexing: getting a specific element

In [None]:
# Slicing: selecting a set of elements

---------------------------------
# CHALLENGE 7
---------------------------------

## Slicing Subsets of Rows and Columns

Slicing using the `[]` operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: `data[start:stop]`.

When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this:

## Copying Objects vs Referencing Objects

## Subsetting Data using Criteria

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

`iloc` is primarily an integer-based indexing counting from 0. That is, you specify rows and columns giving a number. Thus, the first row is row 0, the second column is column 1, etc.

`loc` is primarily a label-based indexing where you can refer to rows and columns by their name. E.g., column `year`. Note that integers may be used, but they are interpreted as a label.

In [None]:
# iloc[row slicing, column slicing]

In [None]:
# loc[row slicing, column slicing]

When using `loc`, integers can be used, but the integers refer to the index label and not the position. For example, using `loc` and select 1:4 will get a different result than using `iloc` to select rows 1:4.

In [None]:
# Select all columns for rows of index values 0 and 10

We can also select a specific data value using a row and column location within the DataFrame and iloc indexing:

Syntax for iloc indexing to finding a specific data element:

`data.iloc[row, column]`

## Subsetting Data using Criteria

We can also select a subset of our data using criteria. Let's select all the movies that were released in 2005.

Or we can select all rows that do not contain the year 2005:

We can define sets of criteria too:

---------------------------------
# CHALLENGE 8
---------------------------------

## Using masks to identify a specific condition

A mask can be useful to locate where a particular subset of values exist or don’t exist - for example, NaN, or “Not a Number” values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include `True` or `False`. For example:

Let’s try this out. Let’s identify all locations in the data that have null (missing or NaN) data values. We can use the `isnull` method to do this. The `isnull` method will compare each cell with a null value. If an element has a null value, it will be assigned a value of `True` in the output object.

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:

We can run `isnull` on a particular column too. What does the code below do?

Let's extract the homepages for the movies with missing titles:

## Checking the format of our data 

The format of individual columns and rows will impact analysis performed on a dataset read into a pandas DataFrame. For example, you can’t perform mathematical calculations on a string (text formatted data).

- Every value has a type.
- Use the built-in function type to find the type of a value.
- Types control what operations can be done on values.
- Strings can be added and multiplied.
- Strings have a length (but numbers don’t).
- Must convert numbers to strings or vice versa when operating on them.
- Can mix integers and floats freely in operations.

### Types of Data

## Working With Integers and Floats

If we divide one integer by another, we get a float.

We can also convert a floating point number to an integer or an integer to floating point number. Notice that Python by default rounds down when it converts from floating point to integer.

In [None]:
# Convert a to an integer

In [None]:
# Convert b to a float

## Working With Our Movies Data

In [None]:
# Convert the id field from an integer to a float

---------------------------------
# CHALLENGE 9
---------------------------------

## Missing Data Values - NaN

NaN (Not a Number) values are undefined values that cannot be represented mathematically. pandas, for example, will read an empty cell in a CSV or Excel sheet as NaN.

NaNs have some desirable properties: if we were to average the `budget` column without replacing our NaNs, Python would know to skip over those cells.

Dealing with missing data values is always a challenge.

It’s sometimes hard to know why values are missing:
- Was it because of a data entry error?
- Or data that someone was unable to collect?
- Should the value be 0? 

We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.

We can figure out how many rows contain NaN values for wei`vote_average`. We can also create a new subset from our data that only contains rows with vote_average > 0 (i.e., select meaningful values):

We can replace all `NaN` values with zeroes using the .`fillna()` method (after making a copy of the data so we don’t lose our work):

However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.

We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all vote_average values.

---------------------------------
# CHALLENGE 10
---------------------------------

## Writing Out Data to CSV

First, let’s reload the data so we’re not mixing up all of our previous manipulations.

Let’s drop all the rows that contain missing values. We will use the command `dropna`. By default, `dropna` removes rows that contain missing data for even just one column.

Export a DataFrame in CSV format and save it in the `data_output` directory.

------------------------------------------------------
# SLIDES
------------------------------------------------------

# Concatenating DataFrames

We often need to combine data files into a single DataFrame to analyze the data.

We can use the `concat` function in pandas to append either columns or rows from one DataFrame to another.

When we concatenate DataFrames, we need to specify the axis:
- `axis=0` will stack the second DataFrame UNDER the first one. Columns need to have the same name and data types.
- `axis=1` will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. Rows need to be related.

In [None]:
# Stack the DataFrames on top of each other

In [None]:
# Reindex the new DataFrame using the reset_index() method

In [None]:
# Write DataFrame to CSV

In [None]:
# Place the DataFrames side by side

## Joining DataFrames

When we concatenated our DataFrames, we simply added them to each other - stacking them either vertically or side by side. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique identifier).

The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.

### Import multiple data files

Many functions in Python have a set of options that can be set by the user if needed. Let's tell pandas to assign empty values in our CSV to NaN with the parameters `keep_default_na=False` and `na_values=[""]`.

### Identifying join keys

In our example, the join key is the `movieId` column.

### Types of joins

#### Inner join
- Returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.
- `merged_inner = pd.merge(left=df1, right=df2, left_on='col1', right_on='col2')`

#### Left join
- Returns all of the rows from the left DataFrame, even those rows whose join key(s) do not have values in the right DataFrame.
- `merged_left = pd.merge(left=df1, right=df2, how='left', left_on='col1', right_on='col2')`

#### Right join
- Returns all of the rows from the right DataFrame, even those rows whose join key(s) do not have values in the left DataFrame.
- `merged_right = pd.merge(left=df1, right=df2, how='right', left_on='col1', right_on='col2')`

#### Full (outer) join
- Returns all pairwise combinations of rows from both DataFrames.
- `merged_outer = pd.merge(left=df1, right=df2, how='outer', left_on='col1', right_on='col2')`

In [None]:
# Inner join

In [None]:
# Left join

In [None]:
# Right join

In [None]:
# Full (outer) join

------------------------------------------------------
# SLIDES
------------------------------------------------------

# Introduction to Plotting

Let's create a copy of our `merged_inner` data and make some plots with that.

In [None]:
# Plot data directly from a Pandas dataframe

In [None]:
# Matplotlib

In [None]:
# Seaborn

In [None]:
# Plotnine

These examples provide the same output visually but differ significantly in the way they are coded. The choice between them depends on the user's preference for customization, simplicity, and familiarity with the plotting paradigm.

#### Summary of Differences

**Matplotlib:**

- Requires more boilerplate code (e.g., `plt.figure()`, `plt.show()`).
- Customisation (color, labels) is done through method arguments.
- The histogram is created using `plt.hist()`.

**Seaborn:**

- Less code than Matplotlib, with some additional aesthetics by default.
- Histogram created with `sns.histplot()`; `kde=False` disables the kernel density estimate line.
- Integrates with Matplotlib for underlying plotting but adds simplicity.

**Plotnine:**

- Follows a declarative style with the *Grammar of Graphics* approach.
- Plots are constructed by layering components (`ggplot`, `aes`, `geom_histogram`).
- Requires fewer explicit function calls for titles and labels but uses a more complex syntax.

------------------------------------------------------
# SLIDES
------------------------------------------------------