In [None]:
import pandas as pd
df = pd.read_csv("survey_data.csv")
df = df.set_index("ID")
df.head()

**Goals:** Describe the distribution of hours spent learning to code

In [None]:
hours = df["HoursSpentLearningToCode"]

In [None]:
hours.hist()

In [None]:
hours.mean()

In [None]:
print(hours.mean())

In [None]:
hours.median()

#### .std() tells you how much values typically deviate from the mean.


In [None]:
hours.std()

In [None]:
df["HoursSpentLearningToCode"].mean()

In [None]:
# get quartiles!
hours.quartile(0.25)
hours.quartile(0.5)
hours.quartile(0.75)

#### It gives the value below which 25% of the data lies. - { hours_25th_percentile = hours.quantile(0.25) }

In [None]:
hours_25th_percentile = hours.quantile(0.25)
hours_50th_percentile = hours.quantile(0.5)
hours_75th_percentile = hours.quantile(0.75)

print(hours_25th_percentile)
print(hours_50th_percentile)
print(hours_75th_percentile)

In [None]:
percentiles = hours.quantile([0.25, 0.5, 0.75])
print(percentiles)

In [None]:
type(percentiles)

In [None]:
percentiles[0.25]

In [None]:
hours.describe()

In [None]:
hours.skew() # strong positive skew

In [None]:
df.describe()

**Goals:** 
- [ ] Describe the distribution of gender
- [ ] Identify the top 5 languages spoken at home

In [None]:
df.dtypes

In [None]:
counts_of_gender = df["Gender"].value_counts()

In [None]:
counts_of_gender

In [None]:
type(counts_of_gender)

In [None]:
counts_of_gender["male"]

In [None]:
counts_of_gender.plot(kind="bar")

In [None]:
counts_of_gender.plot(
    kind="bar",
    color="skyblue",
    title="Counts of Gender",
    xlabel="Gender",
    ylabel="Count",
    figsize=(8, 6)
)

In [None]:
counts_of_gender.plot(kind="column")

In [None]:
# top 5 languages spoken at home
counts_of_languages = df["LanguageAtHome"].value_counts()

In [None]:
type(counts_of_languages)

In [None]:
counts_of_languages.head()

In [None]:
len(counts_of_languages)

In [None]:
counts_of_languages.plot(kind="bar")

In [None]:
top_5_languages = counts_of_languages.head()

In [None]:
type(top_5_languages)

In [None]:
top_5_languages.plot(kind="bar")

**Goal:** Calculate correlation between income and money spent learning to code

In [None]:
df.dtypes

In [None]:
df.plot(kind="scatter", x="Income", y="MoneySpentLearningToCode")

In [None]:
df.plot(kind="scatter", x="Age", y="MonthsSpentProgramming")

##### Key phrasing (this is the best mental model)
- +ve correlation â†’ same-direction movement
- âˆ’ve correlation â†’ opposite-direction movement


In [None]:
df.corr() # does not automatically remove categorical columns

In [None]:
columns = ["Age", "NumberOfChildren", "MoneySpentLearningToCode", "MonthsSpentProgramming", "Income"]

In [None]:
type(columns)

In [None]:
selected_columns = df[columns]

In [None]:
type(selected_columns)

In [None]:
selected_columns.head()

In [None]:
selected_columns.corr()

**Goal:** Determine whether hours spent learning to code varies by number of children

In [None]:
df

In [None]:
# start with entire data frame
grouped_by_children = df.groupby("NumberOfChildren")

In [None]:
type(grouped_by_children)

In [None]:
grouped_by_children["HoursSpentLearningToCode"].count()

In [None]:
grouped_by_children["HoursSpentLearningToCode"].mean()

**Goal:** Determine whether hours spent learning to code varies by **number of children** and **is software dev**

In [None]:
# segment by two features -> use pivot table
df.pivot_table(
    index="NumberOfChildren", # corresponds to row
    columns="IsSoftwareDev",
    values="HoursSpentLearningToCode"
)

In [None]:
df.pivot_table(
    index="NumberOfChildren", # corresponds to row
    columns="IsSoftwareDev",
    values="HoursSpentLearningToCode",
    aggfunc="mean"
)

In [None]:
df.pivot_table(
    index="NumberOfChildren", # corresponds to row
    columns="IsSoftwareDev",
    values="HoursSpentLearningToCode",
    aggfunc="sum"
)

In [None]:
df.pivot_table(
    index="NumberOfChildren", # corresponds to row
    columns="IsSoftwareDev",
    values="HoursSpentLearningToCode",
    aggfunc="std"
)

In [None]:
df.pivot_table(
    index="NumberOfChildren", # corresponds to row
    columns="IsSoftwareDev",
    values="HoursSpentLearningToCode",
    aggfunc=["min", "median", "max"]
)

# Choosing the Right Graph - Quick Reference Guide

## By Data Type:

### 1. SINGLE NUMERICAL COLUMN
- **Histogram** - Shows distribution of one continuous variable
  - "What's the distribution of ages?" â†’ Use histogram
  - `df['Age'].hist()`
  
- **Box Plot** - Shows range, median, quartiles
  - `df['Age'].plot(kind='box')`

### 2. SINGLE CATEGORICAL COLUMN
- **Bar Graph** - Counts of each category
  - "How many males vs females?" â†’ Use bar graph
  - `df['Gender'].value_counts().plot(kind='bar')`

- **Pie Chart** - Proportions of categories
  - `df['Gender'].value_counts().plot(kind='pie')`

### 3. TWO NUMERICAL COLUMNS
- **Scatter Plot** - Shows correlation/relationship
  - "Is there a relationship between Age and Hours Studying?" â†’ Use scatter plot
  - Only for NUMERICAL vs NUMERICAL
  - `plt.scatter(df['Age'], df['HoursSpentLearningToCode'])`

- **Line Plot** - Shows trend over time
  - `df.plot(x='Day', y='Temperature', kind='line')`

### 4. ONE NUMERICAL + ONE CATEGORICAL
- **Box Plot by Category** - Compare distributions
  - "How does Age differ by Gender?" â†’ Use grouped box plot
  - `df.boxplot(column='Age', by='Gender')`

- **Bar Plot with Groups** - Compare means or counts
  - `df.groupby('Gender')['Age'].mean().plot(kind='bar')`

### 5. MULTIPLE VARIABLES
- **Heatmap/Correlation Matrix** - Shows all correlations
  - Numerical data only
  - `df.corr().plot(kind='imshow')`

## Easy Mnemonic to Remember:

| Question | Data Type | Answer |
|----------|-----------|--------|
| **One column, how is it spread out?** | Numerical | **Histogram** (distribution) |
| **One column, what are the categories?** | Categorical | **Bar Graph** (counts) |
| **Two columns, any relationship?** | Both Numerical | **Scatter Plot** (correlation) |
| **Comparing a value across groups?** | 1 Numerical + 1 Categorical | **Box Plot by Group** |
| **Seeing all relationships together?** | Multiple Numerical | **Correlation Heatmap** |

## Key Rules:
- ðŸ”¢ **Numerical = Histogram** (one variable)
- ðŸ“Š **Categorical = Bar Chart** (one variable)
- ðŸ“ˆ **Correlation/Relationship = Scatter Plot** (2+ numerical variables)
- ðŸŽ¯ **Comparing Groups = Box Plot or Grouped Bar** (numerical vs categorical)