<a href="https://colab.research.google.com/github/TamaraDelToro/Coding_Community_Resources/blob/main/Introduction_to_python_for_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to python for data analysis**

### 👋 **Welcome to our first session on using Python programming for data analysis.**

This notebook is designed to introduce you to Python programming in a hands-on, beginner-friendly way. Whether you've never written a line of code before or are just getting started, this session will guide you through the basics of Python, with a focus on how it can be used to explore and analyze biomedical data.

You’ll learn how to:

*  Write simple Python code 🐍

*  Work with real datasets using pandas 🐼

*  Visualize your data using matplotlib 📊

*  Build confidence to continue learning on your own 😎


You don’t need to be a coder to follow along — this session is all about helping you build confidence with the essentials. We'll work through examples together and leave you with a notebook you can revisit anytime 🧡



## **Why Python? (vs. R)**

Python and R are both powerful tools for data analysis, and in truth, both are great depending on your needs. Here's why we're using Python for this session:

*  🔨 **Versatility**: Python is a general-purpose language used in web development, automation, machine learning, and scientific computing. Learning Python opens a lot of doors beyond data analysis.
*  🌎 **Wider ecosystem**: Python integrates well with other tools (e.g., databases, APIs, machine learning models), which is helpful in biomedical research pipelines.
*  📚 **Readable & beginner friendly**: Python has a clean, intuitive syntax that many find easier to learn than R’s.
*  🤝 **Growing adoption in biomedicine**: Increasingly, biomedical researchers and clinicians are adopting Python for data analysis, especially in fields like genomics, epidemiology, and digital health.

That said, R is excellent, particularly for statistical modeling and data visualization. If you already know R — great! If not, Python is a strong place to begin your coding journey.

## 🐍 **1. Python Basics: Just Enough to Get Started**

Before we work with data, let’s get familiar with a few foundations of Python.

###🔤 **Variables**

A variable stores information so you can use it later.

```python
name = "Alice"
age = 30
```

Now `name` is a variable that contains the text "Alice" and `age` contains the number 30.

---

###🔢 **Data Types**

Different kinds of information are called data types. Here are some common ones:
```python
text = "hello"       # a string (text)
number = 5           # an integer
height = 1.70        # a float (decimal number)
is_smoker = False    # a boolean (True or False)
```

---
###📚 **Lists**

A list holds multiple things in order — like a collection.

`ages = [25, 30, 35, 40]`

You can access items with their position (in python, counting starts from 0):

`print(ages[0])  # prints 25`

---
###🗂️ **Dictionaries**

A dictionary stores data in pairs: keys and values.

```python
patient = {
    "name": "Alice",
    "age": 30,
    "smoker": False
}
print(patient["name"])  # prints "Alice"
```

---
###🧰 **Functions and Methods**

A function is like a little machine that does something for you.
```python
print("hello")      # prints text to screen
len(ages)           # gives the length of a list
```
Some things, like strings and lists, have methods (functions that belong to them):

`name.upper()        # makes the text uppercase`

---
###💬 **Comments**

Comments help you explain your code — Python ignores them when running.

`# This is a comment`

###🧠 **Practice Time: Python Basics**

Try solving the following exercises on your own. Don’t worry if you don’t get them right the first time—learning to code is all about trying things out!

1️⃣ **Create and print a variable**

Create a variable called name and assign your name to it.

Print a sentence that says: `Hello, my name is [your name]`

💡 *Hint - you might want to use the `print()` function!*

In [1]:
# Exercise 1: Create and print a variable
my_name =
print('Hello, my name is', ______)

Hello, mu name is Tamara


2️⃣ **What type is this?**

Create the following variables:

```python
age = 30
height = 1.75
is_student = True
```
Use the `type()` function to print the type of each variable.

In [None]:
# Exercise 2: What type is this?


3️⃣ **Importing and using a library**

Import the math library.

Use it to calculate the square root of 144

💡 *Hint: a square root can be calculated like this `math.sqrt()`.

In [None]:
# Exercise 3: importing and using a library


4️⃣ **Mini Challenge: Combining text and numbers**

Create two variables: temperature = 36.6 and unit = "Celsius"

Print: `Your body temperature is 36.6 Celsius`


In [2]:
# Mini challenge: combining text and numbers
# Create your variables


# Print the sentence
print("Your body temperature is", _______, _______)



Your body temperature is 36.6 Celsius


## 📦 **2. Libraries**

Before we dive into data, let's load the main tools we'll use:

| 📦 Library | 🔍 What It’s For |
|------------|------------------|
| `pandas`   | For working with **tables of data** (like Excel but more powerful). |
| `numpy`    | For doing **maths and calculations**, especially on large datasets. |
| `matplotlib.pyplot` | For **making simple graphs and charts** to visualise your data. |

These are **libraries**. Libraries **contain** useful code that allows you to carry out particular tasks, beyong that which base python allows.

When using a function in python that belongs to a library, you have to call the library first:

`library.function()`

**We can give libraries nicknames** when we import them, so that we dont have to write out their full name everytime. While you can name them whatever you want, **there are some commonly used nicknames that will make your code more readable to others!**

Write these into the code cell below and run to load your libraries:
*  `import pandas as pd`
*  `import matplotlib.pyplot as plt`
*  `import numpy as np`

✅ If you're using Google Colab, these libraries are already installed.

💻 If you're working locally, you can install them with:

`!pip install pandas matplotlib numpy`

In [19]:
# Load your libraries here
import _____ as ___

## 💻 **Section 3: Intro to Data Analysis with Pandas**
###🐼 **What is Pandas?**

Pandas is a powerful Python library used to work with structured data. It allows you to load, explore, manipulate, and analyze data with ease—kind of like working with spreadsheets or SQL tables, but in Python.

We mainly work with two data types:

<table>
  <tr>
    <th>Type</th>
    <th>Description</th>
    <th>Analogy</th>
  </tr>
  <tr>
    <td><code>Series</code></td>
    <td>A one-dimensional labelled array</td>
    <td>A single column</td>
  </tr>
  <tr>
    <td><code>DataFrame</code></td>
    <td>A two-dimensional table of rows and columns</td>
    <td>Like an Excel sheet or SQL table</td>
  </tr>
</table>

---
###📥 **Loading Data**
Paste the following into a code cell:
```python
# Load the Pima Indians Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Diabetes']
df = pd.read_csv(url, names=columns)

# Display the first few rows
df.head()
```

In [12]:
# Copy the code here


In [None]:
# Try df.head(2) and df.head(20) - what happens?


###🔍 **Exploring the Dataset**
👉 Try running the following code:
```python
# Overview of the dataset
df.info()

# Summary statistics
df.describe()

# Column names
df.columns

# Check for missing values
df.isnull().sum()
```

In [None]:
# Try it out! - do you know what each function is doing?


###📊 **Selecting and Filtering Data**
👉 Code cell:
```python
# Select a single column (e.g., glucose levels)
df['Glucose']

# Select multiple columns
df[['Glucose', 'Insulin']]

# Filter: patients with high BMI (> 30)
df[df['BMI'] > 30]

# Filter: patients with diabetes
df[df['Diabetes'] == 1]
```



In [None]:
# Give the a go here



###🧪 **Guided Exercise**

🔬 Try this!

Let’s explore the dataset using what we’ve learned so far. Some code is provided to help you get started:

✅ Print the number of patients in the dataset

✅ Find the average glucose level

✅ Count how many patients have diabetes

✅ Find the average BMI of patients who do not have diabetes

👉 Code cell:
```python
# 1. Number of patients in the dataset
print("Total patients:", len(df))

# 2. Average glucose level
print("Average glucose level:", df['Glucose'].mean())

# 3. Number of patients with diabetes
print("Patients with diabetes:", df[df['Diabetes'] == 1].shape[0])

# 4. Average BMI of patients without diabetes
print("Average BMI (non-diabetic patients):", df[df['Diabetes'] == 0]['BMI'].mean())
```

❔Can you figure out what each function does?

In [None]:
# Explore the functions here and figure out what each is doing


##📊 **4. Visualising Data in Python**
Why Visualise?

Data visualisation helps us:

🧠 Understand trends and patterns

🔍 Spot outliers or missing data

🗣 Communicate findings clearly

---
###🧰 **Libraries for Visualisation**
| Library     | Purpose                                      |
|-------------|----------------------------------------------|
| `matplotlib`| The foundational Python plotting library     |
| `seaborn`   | Built on top of matplotlib, prettier & easier |

We’ll start with matplotlib, then briefly show how seaborn improves the look.

---
###🔨 **Basic Plot with matplotlib**
Let’s make a histogram of BMI values to see how body mass index is distributed in our dataset.
```python
# Histogram of BMI
plt.hist(df['BMI'], bins=20, color='skyblue', edgecolor='black')
plt.title("BMI Distribution")
plt.xlabel("BMI")
plt.ylabel("Number of Patients")
plt.show()
```

In [None]:
# Try it yourself!



Now, try changing the column in the code below to visualise a different variable (like `glucose`, `age`, or `blood_pressure` if available)



In [None]:
# Modify the column here!
plt.hist(df['Glucose'], bins=20, color='salmon', edgecolor='black')
plt.title("Glucose Distribution")
plt.xlabel("Glucose Level")
plt.ylabel("Number of Patients")
plt.show()

###🎨 **Enter seaborn — Better Visuals, Less Code**

Here’s how to make a similar plot using seaborn. Notice how it’s simpler and more aesthetic.

In [None]:
import seaborn as sns

# Seaborn histogram
sns.histplot(df['BMI'], kde=True, color='purple')
plt.title("BMI Distribution with KDE")
plt.xlabel("BMI")
plt.ylabel("Number of Patients")
plt.show()

###🧠 **Quick Challenge**

Change the code above to show the distribution of `glucose` or `age` instead.
Bonus: Add `hue='diabetes'` to explore how the variable differs between diabetic and non-diabetic patients.

In [None]:
# Quick challenge - explore seaborn here


###📊 5. Simple Statistical Analysis in Python
In biomedical sciences, we often want to **summarize data, compare groups, and understand relationships between variables**. Let's explore how Python can help with basic stats.

###🧪 **Descriptive Statistics**
You can quickly get a summary of your dataset using pandas:
```python
# Example DataFrame
data = {
    "Patient_ID": [1, 2, 3, 4, 5],
    "Age": [25, 47, 35, 52, 41],
    "Cholesterol_Level": [180, 220, 190, 250, 200]
}
df = pd.DataFrame(data)

# Summary stats
df.describe()
```
💡 `describe()` gives you count, mean, std deviation, min, max, and percentiles — super handy for a quick overview.

In [None]:
# Try it yourself!



###📐 **Comparing Two Groups: t-test (with scipy)**
Let’s say you want to compare cholesterol levels between two treatment groups:
```python
from scipy import stats

# Simulated example
group_A = [180, 190, 195, 205, 210]
group_B = [200, 220, 215, 225, 230]

# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(group_A, group_B)

print("T-statistic:", t_stat)
print("P-value:", p_value)
```

In [None]:
# What do you get?


###🧠 **Correlation: Finding Relationships Between Variables**
```python
# Correlation between Age and Cholesterol
correlation = df["Age"].corr(df["Cholesterol_Level"])
print("Correlation coefficient:", correlation)
```

In [None]:
# Try it out


##🎉 **Wrapping Up**
You’ve just taken your **first steps into data analysis with Python**! From loading data and summarizing it to visualizing and analyzing relationships, you're now equipped to start exploring real-world biomedical datasets.


🌟 *Remember: Programming is a muscle — the more you use it, the stronger it gets.*