## 1. Python & Data Analysis Basics

In this section, I’ll review:
- Python fundamentals (lists, dictionaries, functions)
- Loading and saving CSV files with pandas
- Exploring the structure of a DataFrame using `.head()`, `.info()`, `.shape`, and `.dtypes`

This topic helps build the foundation to inspect datasets and understand their structure before analysis.

In [7]:
# Example of working with a dictionary and a simple function 
expenses = {'Monday': 23.5, 'Tuesday': 17.0, 'Wednesday': 30.0}

def daily_average(expense_dict):
    return sum(expense_dict.values()) / len(expense_dict)

print(f'Average daily expense: €{daily_average(expenses):.2f}')

## Doubt: Couldnt it be done using groupby and .mean()?
# Answer: sum()/len() is python native, while groupby().mean() is ideal when workind with pandas.

Average daily expense: €23.50


In [11]:
# Load CSV with pandas
import pandas as pd

# Load dataset from a public URL
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

In [21]:
# Preview and inspect the dataset
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [27]:
df.shape

(244, 7)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [33]:
df.dtypes

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

### Summary: What I Learned

- `.info()` gives a quick overview of the columns, non-null counts, and data types.
- `.dtypes` helps verify whether variables like `total_bill` are numeric.
- Data types matter — for example, dates must be converted or they won’t work in filters.
- The `tips` dataset has X rows and Y columns, and looks like restaurant billing data (subjective description).


## 2. Data Cleaning & Transformation

In this section, I’ll explore key pandas methods to clean and transform data:
- Identify and handle missing values .dropna(), .fillna()
- Filter rows based on conditions
- Create new columns with logic or calculations
- Use `.apply()` to transform data
- Remove duplicates
- Sort data and reset index

This step is crucial to prepare raw datasets for meaningful analysis.

In [35]:
# Handling Missing Values

df.loc[0:2, 'tip'] = None

# Check for nulls 
df.isnull().sum()

# Fill missing values with the mean
df['tip'] = df['tip'].fillna(df['tip'].mean())

# Alternatively, drop rows with any nulls (df = df.dropna())

In [39]:
# Filter for high bills over €30
high_bills = df[df['total_bill'] > 30]
high_bills.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
11,35.26,5.0,Female,No,Sun,Dinner,4
23,39.42,7.58,Male,No,Sat,Dinner,4
39,31.27,5.0,Male,No,Sat,Dinner,3
44,30.4,5.6,Male,No,Sun,Dinner,4
47,32.4,6.0,Male,No,Sun,Dinner,4


In [49]:
# Calculate tip percentage

df['tip_percentage'] = df['tip'] / df['total_bill'] * 100 {'%'}
df[['total_bill', 'tip', 'tip_percentage']].head()

SyntaxError: invalid syntax (3703020661.py, line 3)