# Introduction: Data Types and Panda's DataFrame

$\textbf{by Ahmed Pirzada, University of Bristol}$

$\textbf{aj.pirzada@bristol.ac.uk}$

$\textbf{27th October 2025}$

## Learning Objectives

- Understand Python basic data types (string, int, list, dict) and printing/f-strings.
- Build a Pandas DataFrame from Python objects and manage indexes.
- Explore data with head/describe, select rows/columns, and compute summary stats.
- Filter rows with `DataFrame.query` using conditions and formulas.
- Group and aggregate data with `groupby`, `agg`, and `reset_index`.
- Create new columns using vectorised operations (means, row-wise max).
- Plot bar, box, and scatter charts with labels and titles using Matplotlib.

## 1. Types of data

What you will do in this section:
- Create and inspect strings and integers using `type()`.
- Build a dictionary and a list, and access elements.
- Practice printing and basic string formatting with f-strings.

In [None]:
# Define a string variable for the student name.
name = 'Adam'

In [None]:
# Check the data type of the variable.
type(name)

In [None]:
# Print the value of the variable.
print(name)

In [None]:
# Use an f-string to format a message with variables.
print(f'Hello, {name}!')

In [None]:
# Define an integer variable for age.
age = 20

In [None]:
# Check the data type (should be int).
type(age)

In [None]:
print(f'{name} is {age} years old.')

In [None]:
# Create a dictionary mapping keys (name, age) to values.
dict_student = {'name': name, 'age': age}


In [None]:
print(dict_student)

In [None]:
# Check the data type of the variable.
name = ['Adam', 'Beth', 'Charlie']

In [None]:
type(name)

In [None]:
# Define an integer variable for age.
age = [20, 21, 19]
gender = ['Male', 'Female', 'Male']
math = [60, 75, 54]
econ = [60, 68, 65]

In [None]:
# Combine the lists into a single dictionary of columns.
dict_students = {
    'name': name,
    'age': age,
    'gender': gender,
    'math': math,
    'econ': econ
    }


In [None]:
print(dict_students)

## 2. DataFrame: Python version of Excel spreadsheet

In this section you will:
- Import Pandas and create a DataFrame from your dictionary.
- Append rows with `concat`, remove duplicates, and tidy the index.
- Set 'name' as the index to enable label-based selection.

In [None]:
# Import the Pandas library for tabular data.
import pandas as pd


In [None]:
# Create a DataFrame (table) from the dictionary.
df_students = pd.DataFrame(dict_students)


In [None]:
df_students.head()

In [None]:
# Create another dictionary with duplicate and missing observations.
add_dict = {'name': ['Eva', 'Adam'], 'age': [20, 20], 'gender': ['Female', 'Male'], 'math': [None, 60], 'econ': [75, 60]}

# Create a DataFrame from the new dictionary.
add_df = pd.DataFrame(add_dict)

# Append the new DataFrames to the original DataFrame.
df_students = pd.concat([df_students, add_df], ignore_index=True)

# Display the first 6 rows of the updated DataFrame.
df_students.head(6) # Note: Adam appears twice, and Eva has a missing math score.

In [None]:
# Remove duplicate rows and rebuild the index.
df_students = df_students.drop_duplicates()
df_students.head()

In [None]:
# Remove missing observations
df_students = df_students.dropna()
df_students.head()

## 3. Learn about the dataset

Next steps:
- Preview data with `head()` and compute `describe()`.
- Select specific rows (e.g., loc['Beth']) and columns.
- Calculate means and check correlations between marks.

In [None]:
# Preview the first few rows of the DataFrame.
df_students.head() 

In [None]:
# Summary statistics for numeric columns.
df_students.describe()

In [None]:
# Set the name column as the row index for easy lookups.
df_students = df_students.set_index('name')


In [None]:
df_students.head()

In [None]:
# Select a row by label using .loc and the name index.
df_students.loc['Beth']

In [None]:
# Select multiple columns from the DataFrame.
df_students[['math', 'econ']]

In [None]:
# Compute means of selected numeric columns.
# Mean mark
df_students[['age','math','econ']].mean()

In [None]:
# Select multiple columns from the DataFrame.
# Correlation between age and mark
df_students[['math','econ']].corr()

## 4. Query dataset

You will filter rows using:
- Simple conditions (e.g., math < 60).
- Combined conditions with `and`/`or`.
- Formula-based filters, e.g., average mark thresholds.

In [None]:
# Filter rows where Math is below a threshold.
df_students.query('math < 60')

In [None]:
# Filter rows with combined conditions using and/or.
df_students.query('econ >= 60 and math >= 60')

In [None]:
# Filter rows with combined conditions using and/or.
df_students.query('econ >= 60 or math >= 60')

In [None]:
# Filter using a formula (average mark threshold).
df_students.query('(econ + math)/2 >= 60')

In [None]:
df_male = df_students.query('gender == "Male"')

In [None]:
df_male.head()

## 5. By groups

Goal here:
- Use groupby('gender') to compute statistics.
- Select a single column before aggregating (e.g., ['econ']).
- Return a tidy table with `reset_index()`.

In [None]:
# Group by gender and compute mean for each group.
df_students.groupby('gender').mean()

In [None]:
# Group by gender and compute mean Econ only.
df_students.groupby('gender')['econ'].mean()

In [None]:
# Turn group labels back into a normal column.
df_students.groupby('gender')[['econ']].agg(['mean', 'min', 'max'])

## 6. Creating new variables

What you will create:
- A mean mark using column arithmetic.
- A max mark using row-wise max(axis=1).

In [None]:
# Create a new column with the average of Math and Econ.
df_students['mean mark'] = (df_students['math'] + df_students['econ']) / 2

In [None]:
df_students.head()

In [None]:
# Select multiple columns from the DataFrame.
df_students['is_female'] = (df_students['gender'] == "Female").astype(int)

In [None]:
df_students.head()

## 7. Visualisation

Plot essentials:
- Bar chart to compare marks.
- Box plot to view distributions.
- Scatter plot for relationships; add titles and axis labels.

In [None]:
# Import Matplotlib for plotting.
import matplotlib.pyplot as plt

In [None]:
# Select multiple columns from the DataFrame.
df_students[['math', 'econ']].plot(kind='bar')
plt.title('Student Marks Comparison')
plt.ylabel('Marks')
plt.show()


In [None]:
# Scatter plot to inspect relationship between Math and Econ.
df_students.plot(kind='scatter', x='math', y='econ')
plt.title('Math vs Econ Marks')
plt.xlabel('Math')
plt.ylabel('Econ')
plt.show()

# Student Notes: Code Explanations

Use this summary to understand what each part of the notebook is doing and why. Run a cell with Shift+Enter and read the output before moving on.

1) Basic data types
- Strings: text in quotes, e.g. `name = 'Adam'`. Use `print(name)` or f-strings like `print(f'Hello, {name}!')` to format text.
- Integers: whole numbers, e.g. `age = 20`. `type(x)` shows the data type.
- Dictionaries: keyÃ¢â‚¬â€œvalue mapping, e.g. `{'name': 'Adam', 'age': 20}`. Access with `dict_student['age']`.
- Lists: ordered collections, e.g. `['Adam','Beth','Charlie']`. Lists become DataFrame columns later.

2) Building a DataFrame (Excel-like table)
- `import pandas as pd` loads the Pandas library used for data work.
- `pd.DataFrame(dict_students)` converts your dictionary of lists into a table with rows and columns.
- `pd.concat([df1, df2], ignore_index=True)` stacks tables and rebuilds a clean index (0,1,2,Ã¢â‚¬Â¦).
- `drop_duplicates().reset_index(drop=True)` removes repeated rows and tidies the index.
- `set_index('name')` makes the `name` column the row label to simplify lookups like `loc['Beth']`.

3) Exploring the data
- `head()` previews the first few rows.
- `describe()` gives summary statistics for numeric columns (mean, std, min/max, quartiles).
- `loc['Beth']` selects a row by its index label after `set_index('name')`.
- `df[['math','econ']]` selects multiple columns. `mean()` on numeric columns computes averages. `corr()` shows correlations.

4) Filtering with query
- `query('math < 60')` keeps rows where the condition is true.
- Combine conditions: `and` (both must be true) or `or` (either can be true).
- You can filter using formulas, e.g. `query('(econ + math)/2 >= 60')` for average mark Ã¢â€°Â¥ 60.

5) Grouping
- `groupby('gender').mean()` computes the mean for each gender.
- Select a single column before aggregating: `groupby('gender')['econ'].mean()`.
- Multiple stats: `agg(['mean','min','max'])`. Use `reset_index()` to turn the group labels back into a normal column.

6) Creating new variables
- Vectorised operations create columns using column arithmetic, e.g. `df['mean mark'] = (df['math'] + df['econ'])/2`.
- `max(axis=1)` finds the row-wise maximum across selected columns to create `max mark`.

7) Visualisation
- Bar: compare values across students or subjects.
- Box: view distribution (median, quartiles, potential outliers).
- Scatter: examine relationship between two variables (e.g., math vs econ). Add titles and axis labels for clarity.

Tips
- Read outputs after each step to build intuition.
- If something errors, re-run the cell that creates the object (e.g. the one that defines `df_students`).
- Use `type(obj)` and `obj.shape` (for DataFrames) to quickly check what you have.