# Notes:

## df.age vs df['age'] vs df.loc['age]

### df['age']
- Returns the age column as a Series.
- Works even if column names have spaces or special characters.
- Always recommended, especially in production code.

### df.age
- Also returns the age column, just like df['age'].
- BUT: It only works if:
    - The column name is a valid Python variable name (e.g., no spaces, doesn’t start with numbers).
    - The column name doesn’t conflict with a DataFrame method (e.g., df.count → this gives the method .count() instead of a column).
- ✅ Good for quick interactive work (like in Jupyter),
- 🚫 Not recommended for critical code.

### df.loc['age']
- It accesses rows, not columns.
- It assumes 'age' is an index label.
- So unless 'age' is in the row index, you'll get a KeyError.
```python
    df.set_index('name', inplace=True)
    df.loc['Alice']  # now 'Alice' is in the index
```


## Axis

| Function  | `axis=0`                            | `axis=1`                         |
| --------- | ----------------------------------- | -------------------------------- |
| `concat`  | Stack rows **on top of each other** | Combine columns **side-by-side** |
| `drop`    | Drop rows by index                  | Drop columns by name             |
| `sum()`   | Sum **column-wise** (per column)    | Sum **row-wise** (per row)       |
| `mean()`  | Mean for each column                | Mean for each row                |
| `apply()` | Apply function to each column       | Apply function to each row       |


In [1]:
import pandas as pd
from sklearn.datasets import fetch_california_housing

In [2]:
df = fetch_california_housing(as_frame=True).frame

In [3]:
## Data exploration functions 

# pd.options.display.max_columns=5  # change the default max columns number to 5 to show only five columns
df
df.info()  # show info about the dataset
# df.head()  # the first 5 rows, in case of i didn't pass n to the function
# df.tail()  # the last 5 rows, in case of i didn't pass n to the function
# df.sample()  # any random row, or pass n=number to get any random number of rows
# list(df.columns) # get datasets columns


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [4]:
## Statistical functions & plotting
## Plotting means "Creating a visual representation of data, like a chart or a graph like charts and graphs"

# df.describe()  # describe the dataframe
# df['HouseAge']  # series of column called houseAge, i can use type() function to assure that it's a Series
# df['HouseAge'].mean()  # average which is the sum of all values divided by the number of values.
# df['HouseAge'].min()  # min value
# df['HouseAge'].max()  # max value
# df['HouseAge'].std()  # Standard deviation  measures how much the values in a dataset differ from the average (mean).
# df['HouseAge'].median()  # Median is the middle number in a sorted list, if you have an even number of items, the median is the average of the middle two.
# df['HouseAge'].mode()  # the most common value
# df['HouseAge'].hist(figsize=(10, 7))  # Create a histogram of this columns which uses matplotlib behind the scene
# df['HouseAge'].plot(figsize=(10, 7))  # Create a line chart of this columns which uses matplotlib behind the scene

In [5]:
## Accessing data

df = pd.DataFrame({
    'name': ['Alice', 'Mike', 'Ramy'],
    'age': [20, 30, 45],
    'profession': ['Programmer', 'Clerk', 'Designer']
})
# df.loc[1]
# df.set_index('name', inplace=True)
# df.loc['Alice']
# df.iloc[1]  # integer location, get the row value at index 1
# df.iloc[1, 0]  # second value (0) to get the index 0 value
# df.loc['Alice', 'age']  # get age of row "Alice"
# df.at["Alice", "age"]  # worked the same like "loc", but to get specific value "age" not the entire row
# df.iat[1, 0]  # worked the same like "iloc", but to get specific value at index (1) not the entire row
# df.loc['Alice', 'age'] = 50  # change the value of the age, i can use "at" also
# df.loc["Alice"] = [75, "Backend"]  # change the values of the entire row
# df.loc["John"] = [90, "Teacher"]  # adding new row
# df.iloc[0:2]  # get slice of the df
# df.iloc[0:2, 1]  # get column (1) only which is "profession" from this slice (0:2)
# df.iloc[:, 1]  # get column (1) only which is "profession" from the entire df (:)

In [6]:
## Manipulation Data

# df.reset_index(drop=True, inplace=True)  # Reset index to default index (0 1 2 ..), "drop" parameter drops the current index instead of moving it into columns.
df
# df['age'] * 2
# df['age'] ** 2
# df['age'] = df['age'] * 2  

# def my_func(x):
#     if x % 2 == 0:
#         return x * 2
#     else:
#         return x
    
# df['age'] = df['age'].apply(my_func)  # apply function to age
# df['age'] = df['age'].apply(lambda x: x * 2 if x % 2 == 0 else x)  # apply lambda function

# df['summary'] = df.apply(lambda row: f"{row['name']} is a {row['profession']} and is {row['age']} years old", axis=1)  # axis=1 means "Apply the function row-wise instead of column-wise"

# df = df.apply(lambda row: row['name'], axis=1)  # row-wise
# df = df.apply(lambda col: col[0], axis=0)   # column wise

# df = df.drop('summary', axis=1) 
# df = df.drop(['age', 'profession'], axis=1)

# df = df.drop([0])           # Drops row at index 0, axis=0 "default"
# df = df.drop([0, 2])        # Drops rows at index 0 and 2, axis=0 "default"
# df = df.drop(index=[0, 2])  # Same, a bit clearer, axis=0 "default"

df.iat[1, 1] = float('nan')  # update value to be nan
# df.dropna()  # drop rows that has nan values
# df.fillna(0)  # replace nan values with 0
# df.age = df.age.fillna(df.age.mean())  # replace nan values of age with the average age value
df.notna()  # False for nan value and True for other values
df[df.age.notna()]  # show all non-nan rows


Unnamed: 0,name,age,profession
0,Alice,20.0,Programmer
2,Ramy,45.0,Designer


In [7]:
## Iterating over Data Frames

# df.set_index('name', inplace=True)
# df.reset_index(inplace=True)
# for i, row in df.iterrows():  # iterate over rows
#     print(row)

for i, col in df.items():  # iterate over rows
    print(col[0])

Alice
20.0
Programmer


In [8]:
## Filtering & Querying

# df['age'] > 20  # returns True/False
df[df['age'] > 20]  # returns the actual data that check the condition
df[~(df.age > 30) & (df.age < 90)]  # multiple conditions, ~ at the beginning reverse the result of condition

df['job'] = ['Programmer', 'Clerk', 'Designer']
df = df.drop('profession', axis=1)
df[(df.name.str.endswith('e')) & (df['age'].notna())]
df.age = df.age.fillna(30)

import datetime as dt 
df['birthday'] = df.age.apply(lambda x: dt.datetime.now() - dt.timedelta(365 * x))
df[df['birthday'].dt.year > 2000]

ages = [20, 30]
df[df.age.isin(ages)]

df.query('age > 30')  # it's good on performance but can't do all filtering 

Unnamed: 0,name,age,job,birthday
2,Ramy,45.0,Designer,1980-07-13 01:18:03.658274


In [9]:
## Grouping data

# df.iat[1, 2] = "Programmer"
df.loc[4] = ['John', 40, "Programmer", None]
df
df.groupby('job').agg({
    'age': ['mean', 'min', 'max', 'sum']
})  # get the (average, min, max, sum) of "age" groupby job

  df.loc[4] = ['John', 40, "Programmer", None]


Unnamed: 0_level_0,age,age,age,age
Unnamed: 0_level_1,mean,min,max,sum
job,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Clerk,30.0,30.0,30.0,30.0
Designer,45.0,45.0,45.0,45.0
Programmer,30.0,20.0,40.0,60.0


In [10]:
## Sorting Values

df.sort_values('age')
df.sort_values('age', ascending=False)

Unnamed: 0,name,age,job,birthday
2,Ramy,45.0,Designer,1980-07-13 01:18:03.658274
4,John,40.0,Programmer,NaT
1,Mike,30.0,Clerk,1995-07-10 01:18:03.658272
0,Alice,20.0,Programmer,2005-07-07 01:18:03.658241


### Merging, Concatenating & Joining

| Function      | Main Use                                                     | Join Based On                    | Typical Use Case                        |
| ------------- | ------------------------------------------------------------ | -------------------------------- | --------------------------------------- |
| `pd.concat()` | Stack dataframes vertically (rows) or horizontally (columns) | By index or axis                 | Combine datasets without keys           |
| `pd.merge()`  | SQL-style joins                                              | On common columns or custom keys | Relational joins (e.g. by `id`, `user`) |
| `df.join()`   | Join based on index                                          | On index (can use columns too)   | Quickly join side data based on index   |

```python

    df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
    df2 = pd.DataFrame({'id': [1, 2], 'score': [90, 85]})

    # merge on 'id'
    pd.merge(df1, df2, on='id')

    # join (after setting index)
    df1.set_index('id').join(df2.set_index('id'))

    # concat side by side (by index)
    pd.concat([df1, df2], axis=1)

```



In [None]:

# Concatenating means combining the entire dataframes together
# - Think of stacking things "رص الأشياء".
# - Can stack rows (axis=0) or columns (axis=1)
# - No matching keys — just aligns by axis.
df1 = pd.DataFrame({
    'Item': ['A', 'B', 'C'],
    'Price': [10, 20, 30]
})
df2 = pd.DataFrame({
    'Item': ['D', 'E', 'F'],
    'Price': [40, 50, 60]
})

pd.concat([df1, df2]).reset_index().drop('index', axis=1)  # combining the 2 dataframes into one dataframe with index from 0 to 5 and 2 columns (Item, Price)

df1 = pd.DataFrame({
    'Item': ['A', 'B', 'C'],
    'Price': [10, 20, 30]
})
df2 = pd.DataFrame({
    'Country': ['X', 'Y', 'Z'],
    'Available': [True, True, False]
})

pd.concat([df1, df2], axis=1)  # axis=1 means concatenating columns side-by-side, horizontally not vertically


# Merging is combining rows from df1 and df3 based on matching values in a key column
# - Like SQL JOIN
# - Use when two DataFrames share a common column or key
# - Can specify how='left' | 'right' | 'outer' | 'inner'
df3 = pd.DataFrame({
    'Item': ['B', 'C', 'D'],
    'Country': ['X', 'Y', 'Z']
})

pd.merge(df1, df3, how='inner')  # Merge dataframes using the mutual values of "Item" field (B, C) , excluding the differences (A, D), inner is the default value and we can remove it
pd.merge(df1, df3, how='outer')  # Keeps all rows from both df1 and df3., If there's no match, fills with NaN.
pd.merge(df1, df3, how='left')  # Keeps all rows from df1, Adds matching data from df3, If no match, fills with NaN on df3 side.
pd.merge(df1, df3, how='right')  # Keeps all rows from df3, Adds matching data from df1, If no match, fills with NaN on df1 side.

pd.merge(df1, df3, on='Item', how='right')  # "on" specifies the column to merge on


# Join means joins dataframes on index not columns
# - Like .merge() but simpler and uses index by default
# - Usually: df1.join(df2) joins df2 to df1 by index
df4 = pd.DataFrame({
    'Price': [10, 20, 30]
}, index=['A', 'B', 'C'])
df5 = pd.DataFrame({
    'Country': ['X', 'Y', 'Z']
},  index=['B', 'C', 'D'])

df4.join(df5)  # join bases on index, "how" default value is "left" refers to "df4" Missing in df5 = NaN
df4.join(df5, how='inner')  # join bases on index, inner means Keep only matching indexes from both df4 and df5
df4.join(df5, how='outer')  # join bases on index, outer means Keep all indexes from both
df4.join(df5, how='right')  # join bases on index, right means Keep all indexes from both and Missing in df4 = NaN



Unnamed: 0,Price,Country
B,20.0,X
C,30.0,Y
D,,Z
