# Introduction: Data Types and Panda's DataFrame

$\textbf{by Ahmed Pirzada, University of Bristol}$

$\textbf{aj.pirzada@bristol.ac.uk}$

$\textbf{27th October 2025}$

## Learning Objectives

- Understand Python basic data types (string, int, list, dict) and printing/f-strings.
- Build a Pandas DataFrame from Python objects and manage indexes.
- Explore data with head/describe, select rows/columns, and compute summary stats.
- Filter rows with `DataFrame.query` using conditions and formulas.
- Group and aggregate data with `groupby`, `agg`, and `reset_index`.
- Create new columns using vectorised operations (means, row-wise max).
- Plot bar, and scatter charts with labels and titles using Matplotlib.

## 1. Types of data

What you will do in this section:
- Create and inspect strings and integers using `type()`.
- Build a dictionary and a list, and access elements.
- Practice printing and basic string formatting with f-strings.

In [None]:
# Define a string variable for the student name.
name = 'Adam'

In [None]:
# Check the data type of the variable.
type(name)

str

In [None]:
# Print the value of the variable.
print(name)

Adam


In [None]:
# Use an f-string to format a message with variables.
print(f'Hello, {name}')

Hello, Adam


In [None]:
# Define an integer variable for age.
age = 20

In [None]:
# Check the data type (should be int).
type(age)

int

$\textbf{To-do:}$ Use f' to print: {name} is {age} years old

In [None]:
# Print: {name} is {age} years old
print(f'{name} is {age} years old')

Adam is 20 years old


------   

In [None]:
# Create a dictionary mapping keys (name, age) to values.
dict_student = {'name':name, 'age':age}


In [None]:
# Print the dictionary.
print(dict_student)

{'name': 'Adam', 'age': 20}


$\textbf{To-do:}$ Create a dictionary for your group with information on name, gender, distance, happiness, wentout

In [None]:
# Create list of names of students in the group.


In [None]:
# List for gender - keep the same order as for names list.


In [None]:
# List for distance from your accommodation to university (in km).


In [None]:
# List for happiness level (0-10).


In [None]:
# List for number of times went out last week.


In [None]:
# Create a dictionary for your group with information on name, gender, distance, happiness, wentout
dict_students = {

    }

In [None]:
# Print the dictionary.


---


## 2. DataFrame: Python version of Excel spreadsheet

In this section you will:
- Import Pandas and create a DataFrame from your dictionary.
- Remove duplicates and drop missing values

In [None]:
# Import the Pandas library for tabular data.
import pandas as pd

$\textbf{To-do:}$ Convert Dictionary to DataFrame

In [None]:
# Create a DataFrame (table) from the dictionary.
df_students = pd.DataFrame(dict_students)

In [None]:
# Display the first few rows of the final DataFrame.


---

In [None]:
# Remove duplicate rows from the DataFrame.
df_students = df_students.drop_duplicates()

In [None]:
# Remove missing observations from the DataFrame.
df_students = df_students.dropna()

## 3. Learn about the dataset

Next steps:
- Preview data with `head()` and compute `describe()`.
- Select specific rows (e.g., loc['Beth']) and columns.
- Calculate means and check correlations.

In [None]:
# Preview the first few rows of the DataFrame.
df_students.head()

In [None]:
# Summary statistics for numeric columns.
df_students.describe()

In [None]:
# Set the name column as the row index for easy lookups.
df_students = df_students.set_index('name')
df_students.head()

$\textbf{To-do:}$ Learn about the dataset you created

In [None]:
# Select a row by label using .loc and the name.


In [None]:
# Select multiple columns from the DataFrame.


In [None]:
# Compute means of selected numeric columns.


In [None]:
# Compute correlation between numeric columns.


---

## 4. Query dataset

You will filter rows using:
- Simple conditions (e.g., math < 60).
- Combined conditions with `and`/`or`.
- Formula-based filters, e.g., average mark thresholds.
- Create a new DataFrame.

$\textbf{To-do:}$ Use 'query' method to learn about your dataset

In [None]:
# Filter rows where Math is below a threshold.


In [None]:
# Filter rows with multiple conditions using 'and'


In [None]:
# Filter rows with multiple conditions using 'or'


In [None]:
# Filter using a formula


In [None]:
# Create a new DataFrame with filtered results.
df_new = ...

In [None]:
# Display the first few rows of the new DataFrame.


---

## 5. By groups

Goal here:
- Use groupby('gender') to compute statistics.
- Select a single column before aggregating (e.g., ['econ']).

$\textbf{To-do:}$ Use 'groupby' method to learn about your dataset

In [None]:
# Group by gender and compute mean for each group.


In [None]:
# Group by gender and compute mean for one variable only.


In [None]:
# Group by gender and compute multiple statistics for one variable only.


---

## 6. Creating new variables

What you will create:
- A mean mark using column arithmetic.
- A max mark using row-wise max(axis=1).

$\textbf{To-do:}$ Create new variables

In [None]:
# Add a new variable to the DataFrame as a function of existing variables.

In [None]:
# Check what the updated DataFrame looks like.


In [None]:
# Create a dummy variable for gender.


In [None]:
# Check what the updated DataFrame looks like.


---

## 7. Visualisation

Plot essentials:
- Bar chart to compare marks.
- Box plot to view distributions.
- Scatter plot for relationships; add titles and axis labels.

In [None]:
# Import Matplotlib for plotting.
import matplotlib.pyplot as plt

$\textbf{To-do:}$ Plot a bar chart and a scatter plot

In [None]:
# Select a column from the DataFrame and plot a bar chart.


In [None]:
# Scatter plot to inspect relationship between two of the variables.


---

# Student Notes: Code Explanations

Use this summary to understand what each part of the notebook is doing and why. Run a cell with Shift+Enter and read the output before moving on.

1) Basic data types
- Strings: text in quotes, e.g. `name = 'Adam'`. Use `print(name)` or f-strings like `print(f'Hello, {name}!')` to format text.
- Integers: whole numbers, e.g. `age = 20`. `type(x)` shows the data type.
- Dictionaries: keyÃ¢â‚¬â€œvalue mapping, e.g. `{'name': 'Adam', 'age': 20}`. Access with `dict_student['age']`.
- Lists: ordered collections, e.g. `['Adam','Beth','Charlie']`. Lists become DataFrame columns later.

2) Building a DataFrame (Excel-like table)
- `import pandas as pd` loads the Pandas library used for data work.
- `pd.DataFrame(dict_students)` converts your dictionary of lists into a table with rows and columns.
- `pd.concat([df1, df2], ignore_index=True)` stacks tables and rebuilds a clean index (0,1,2,Ã¢â‚¬Â¦).
- `drop_duplicates().reset_index(drop=True)` removes repeated rows and tidies the index.
- `set_index('name')` makes the `name` column the row label to simplify lookups like `loc['Beth']`.

3) Exploring the data
- `head()` previews the first few rows.
- `describe()` gives summary statistics for numeric columns (mean, std, min/max, quartiles).
- `loc['Beth']` selects a row by its index label after `set_index('name')`.
- `df[['math','econ']]` selects multiple columns. `mean()` on numeric columns computes averages. `corr()` shows correlations.

4) Filtering with query
- `query('math < 60')` keeps rows where the condition is true.
- Combine conditions: `and` (both must be true) or `or` (either can be true).
- You can filter using formulas, e.g. `query('(econ + math)/2 >= 60')` for average mark Ã¢â€°Â¥ 60.

5) Grouping
- `groupby('gender').mean()` computes the mean for each gender.
- Select a single column before aggregating: `groupby('gender')['econ'].mean()`.
- Multiple stats: `agg(['mean','min','max'])`. Use `reset_index()` to turn the group labels back into a normal column.

6) Creating new variables
- Vectorised operations create columns using column arithmetic, e.g. `df['mean mark'] = (df['math'] + df['econ'])/2`.
- `max(axis=1)` finds the row-wise maximum across selected columns to create `max mark`.

7) Visualisation
- Bar: compare values across students or subjects.
- Box: view distribution (median, quartiles, potential outliers).
- Scatter: examine relationship between two variables (e.g., math vs econ). Add titles and axis labels for clarity.

Tips
- Read outputs after each step to build intuition.
- If something errors, re-run the cell that creates the object (e.g. the one that defines `df_students`).
- Use `type(obj)` and `obj.shape` (for DataFrames) to quickly check what you have.