# Data Science Bootcamp – ML/AI Assignment

## Section 1: Python + Data Familiarity

### 1. What is the difference between a list, a dictionary, and a NumPy array? Give one example of each.
- **List**: An ordered collection of items. Mutable. Can contain different data types.
- **Dictionary**: A collection of key-value pairs. Useful for lookups.
- **NumPy Array**: Used for numerical computations. Supports vectorized operations and broadcasting.

### Example:
```python
import numpy as np

my_list = [1, 2, 3]
my_dict = {'name': 'Alice', 'age': 25}
my_array = np.array([1, 2, 3])

print('List:', my_list)
print('Dictionary:', my_dict)
print('NumPy Array:', my_array)
```

In [1]:
def square_even_numbers(numbers):
    return [x**2 for x in numbers if x % 2 == 0]

numbers = [10, 15, 20, 25, 30]
square_even_numbers(numbers)

### 3. What does the following code output, and why?
```python
x = [1, 2, 3]
y = x
y.append(4)
print(x)
```
**Output:**
[1, 2, 3, 4]
**Why?** `x` and `y` point to the same list in memory, so changes to `y` affect `x`.

In [2]:
import pandas as pd

df = pd.DataFrame({
    'Age': [22, 25, 30, 28, 40],
    'Salary': [50000, 60000, 80000, 75000, 90000]
})

print(df.shape)         # Rows and columns
print(df.describe())    # Summary statistics

### 5. Load and explore monthly passenger data
```python
# Let's recreate the data manually since it's small
data = {
    'Month': ['JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'],
    '1958': [340,318,362,348,363,435,491,505,404,359,310,337],
    '1959': [360,342,406,396,420,472,548,559,463,407,362,405],
    '1960': [417,391,419,461,472,535,622,606,508,461,390,432]
}
df = pd.DataFrame(data)
df.head()
```
**What I noticed:**
- Monthly air passengers are increasing over years.
- Seasonality is visible — peaks around mid-year (summer).

In [3]:
# Total passengers per month across years
df['Total'] = df['1958'] + df['1959'] + df['1960']
max_month = df[df['Total'] == df['Total'].max()]
min_1958 = df[df['1958'] == df['1958'].min()]

print('Month with highest total passengers:\n', max_month)
print('Month with lowest passengers in 1958:\n', min_1958)

### 7. What does .groupby() do?
```python
# Group by example
dummy = pd.DataFrame({
    'Department': ['IT', 'HR', 'IT', 'HR'],
    'Salary': [70000, 50000, 80000, 55000]
})

grouped = dummy.groupby('Department').mean()
print(grouped)
```
**Explanation:** `.groupby()` is used to split data into groups based on some criteria and perform aggregation (mean, sum, count, etc.).

In [4]:
import seaborn as sns

titanic = sns.load_dataset('titanic')
titanic.isnull().sum()

In [5]:
titanic['age'].hist(bins=30)

**Shape:** Right-skewed distribution

**Reason:** Many younger passengers; older ages taper off. Also, some missing data may influence this.

In [6]:
import numpy as np

low_std = [50, 51, 49, 52, 50]
high_std = [10, 100, 200, 5, 300]

print('Low STD:', np.std(low_std))
print('High STD:', np.std(high_std))

**Explanation:** High std. dev means data is spread out (more variation). Low std. dev means data is clustered near the mean.

### 11. Real-world problems with missing data
1. **Medical Diagnosis**: Missing test results can mislead disease detection.
2. **Credit Scoring**: Incomplete financial data can cause incorrect credit decisions.