---
# Overview  
This project demonstrates **basic to intermediate** data manipulation techniques using the **Pandas** library on the **Cartwheel dataset** (a hypothetical or real dataset typically containing metrics like cartwheel distances, participant ages, genders, etc.).  

### Objectives  
- Load and inspect the dataset.  
- Clean and preprocess data (handle missing values, outliers).   
---

In [5]:
# Import the Pandas library for data manipulation and analysis
import pandas as pd

# Define the file path to the CSV dataset (note the raw string 'r' or double backslashes for Windows paths)
file = r"C:\Users\Felipe\Documents\Filipe docs\ai projects\DataScienceShowcase\Datasets\cartwheel.csv"

# Load the CSV file into a Pandas DataFrame
df = pd.read_csv(file)

# Check the type of the loaded object (should output: pandas.core.frame.DataFrame)
type(df)

pandas.core.frame.DataFrame

---
### Visualizing the data
---

In [None]:
# Display the first few rows of the DataFrame to understand its structure and contents
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [6]:
# Display the columns of the DataFrame to see what data is available
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

---
### DataFrame Column Types

In a DataFrame:
- Each column has one specific type (like numbers, text, or dates)
- Different columns can have different types

This makes sense because real data has different kinds of information, but each piece should be consistent.

### How to Check Types

df.dtypes

---

In [7]:
# Display the data types of each column in the DataFrame to understand the nature of the data
df.dtypes

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

___

## Pandas DataFrame Indexing Basics

### Position vs Label Indexing
- Rows/columns can be referenced by:
  - Position (0-based, like Python lists)
  - Labels (custom names for rows/columns)

### Best Practices
✔ Prefer column names (e.g., `df['Age']`) over positions  
✖ Avoid numeric positions (e.g., `df[3]`) - they break if columns move

### Default Indexes
- Rows: Default to 0,1,2... (often kept as-is)  
- Columns: Usually have meaningful names (rarely use numeric positions)

### Main Selection Methods
1. `.loc[]` - Select by **label** (row/column names)
2. `.iloc[]` - Select by **position** (0-based numbers)
3. `.ix[]` - **Deprecated** (previously mixed both methods)

> Tip: Stick with `.loc` and `.iloc` for clear, maintainable code.

___

In [8]:
# Access the 'CWDistance' column from the DataFrame
df.loc[:,"CWDistance"]

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [9]:
# Access the 'CWDistance' column using a different method
df["CWDistance"]

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [10]:
# Access the 'CWDistance' column using dot notation
df.CWDistance

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [11]:
# Access multiple columns ('CWDistance', 'Height', 'Wingspan') from the DataFrame
df.loc[:,["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [12]:
# Access the first 10 rows of the specified columns ('CWDistance', 'Height', 'Wingspan')
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [13]:
# Access the first 10 rows of the DataFrame
df.loc[10:15]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6
11,12,28,F,1,Y,1,62.75,58.0,79,Y,1,10
12,13,25,F,1,Y,1,65.0,64.5,92,Y,1,6
13,14,23,F,1,N,0,61.5,57.5,66,Y,1,4
14,15,31,M,2,Y,1,73.0,74.0,72,Y,1,9
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6


In [14]:
# Access the first 4 rows of the DataFrame
df.iloc[:4]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10


In [15]:
# Access the first 4 rows and the first 3 columns of the DataFrame
df.iloc[1:5, 2:4]

Unnamed: 0,Gender,GenderGroup
1,F,1
2,F,1
3,F,1
4,M,2


In [16]:
# Access the first 4 rows and the first 3 columns of the DataFrame using a different method
df.iloc[1:5, :][["Gender", "GenderGroup"]]

Unnamed: 0,Gender,GenderGroup
1,F,1
2,F,1
3,F,1
4,M,2


In [17]:
# Unique values
df["Gender"].unique()

array(['F', 'M'], dtype=object)

In [18]:
# Create a new column 'GenderGroup' based on the unique values
df["GenderGroup"].unique()

array([1, 2])

In [20]:
# Gender and GenderGroup columns
df[["Gender", "GenderGroup"]]

Unnamed: 0,Gender,GenderGroup
0,F,1
1,F,1
2,F,1
3,F,1
4,M,2
5,M,2
6,M,2
7,F,1
8,M,2
9,F,1


In [21]:
# Create a crosstab to summarize the relationship between gender and gender group
pd.crosstab(df["Gender"], df["GenderGroup"])

GenderGroup,1,2
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,12,0
M,0,13


In [23]:
# Count the number of occurrences for each unique combination of Gender and GenderGroup
# Returns a pandas Series with a MultiIndex containing the counts
df.groupby(['Gender','GenderGroup']).size()

Gender  GenderGroup
F       1              12
M       2              13
dtype: int64