## Pandas  
***Pandas is a library for woking with data. It's two most well known structures are:***
-**`Series` and `DataFrames`**
It can be a potential substitute for Excel.

### Series
`Series` are data structures very similar to one-dimensional arrays.
**Format**: 
`Series[(data,index,dtype,names,copy,...)]`

#### Syntax 
```python
import pandas as pd
a = [i for i in range(10)]
a_series = pd.series(a)
```
#### Some series attributes
 Attribute | Return | Description
 --------- | ------ | -----------
 Series.index | Range Index | Returns an iterable with series indices
 Series.dtype | dtype('Object') | Returns the type of the data
 Series.size | int | Returns the number of elements
 Series.name | str | Returns the name of the series

**Note**: Many attributes are common  between `Series` and `DataFrames`. 

##### Example 
```python 
import pandas as pd 
numbers = [i + 10 for i in range (10)]
index = list('abcdefghij')
numbers_s = pd.series(numbers, index=index, name = 'numbers')
```
### DataFrame
***These are data structures in table format possessing various functionalities similar to Excel***
#### Format 
`DataFrame([data, index, columns, dtype, copy,])`
#### Syntax
```python
import pandas as pd
a = [i for i in range (10)]
b = [i for i in range (10)]
data = {'Col A' : a ,
        'Col B' : b
        }
data_df = pd.DataFrame(data)
```
#### Some DataFrame attributes

 Attribute | Return | Description
 --------- | ------ | -----------
 DataFrame.index | Range index | returns an iterable with the indices of the series
 DataFrame.columns | Index | returns an iterable with the columns names
 DataFrame.dtypes | Series | returns a serie with data types
 DataFrame.values | ndarray | returns a numpy array with values
 DataFrame.size | int | returns number of values in the DataFrame
 DataFrame.shape | tuple | returns a tuoke with number of rows and columns in the DataFrame 

##### Column Selection: 
`DataFrame[ColumnName]` returns a series with the selected columns. 
`DataFrame[[ColumnNameA, ColumnNameB,...]]` returns a DataFrame with the selected columns.
`DataFrame[newName] = newData` adds a series with the selected column.

##### Examples
```python
import pandas as pd
import numpy as np
a = list('abcdefghijkl')
b = np.linspace(0,20,12)
data = {('Char': a,
        'numbers' : b)}

data_df = pd.DataFrame(data)
c = pd.Series(range(12), name = 'Series Example b')
data_df[c.name] = c
```

#### Basic Methods for DataFrame
 Method | Description
 ------ | -----------
 DataFrame.head([n]) | returns the first n rows of the DataFrame. Default is 10
 DataFrame.tail([n]) | returns the last n rows of the DataFrame. Default is 10 
 DataFrame.min([axis]) | Default is `axis = 0`, 0 is for columns, 1 is for rows
 DataFrame.max([axis]) | Default is `axis = 0`, 0 is for columns, 1 is for rows
 DataFrame.cumsum([axis]) | Default is `axis = 0`, 0 is for columns, 1 is for rows
 DataFrame.value_counts() | returns a series with the count of how much did an element appear in the DataFrame
 DataFrame.sort_values(by) | sorts the DataFrame according to the argument `by= 'ColumnName'`

### Reading and writing files
**The function for reading CSV file**:
`pandas.read_csv(FilePath, sep= NoDefault.no_default, encoding = None,....)`

**Methods for writing CSV and Excel files**:
`DataFrame.to_excel(path, sheet_name= 'sheet1',...)`
`DataFrame.to_csv(path, sep='',...)`

### Selecting rows and columns with loc and iloc
**`loc` and `iloc` are properties of DataFrames. They are properties that allow us to access rows and columns similar to slicing lists**.
#### Properties of `loc` and `iloc`

 Property | Description
 -------- | -----------
 loc[i,j] | where i and j are the names of the selected indices
 iloc[i,j] | where i and j are the numbers of the selected indices 

##### Example
```python
import pandas as pd
import numpy as np
File = '...\path'
df = pd.read_csf(File)
df.index = np.arange(1000) + 10.1
df.head([10])
col = ['Energy Source', 'Location']
df.loc[10.1 : 14.1, 'Energy Source', 'Location']
df.iloc[0 : 4, 0 : 5]
```
### Filters 
**DataFrame filters can be created from boolean series**
#### Syntax
`colfilter = DataFrame['Column_to_filter'] =='Filter_Value' `  
`DataFrame[colfilter]`
##### Example demonstration 
```python
import pandas as pd
import numpy as np
file = 'path'
colfilter = df['Capacity (MW)'] > 6000
df[colfilter]
renewables_filter = df['Renewables'] == 'Yes'
df_renewables_filter = df[renewables_filter]
```

### Data Cleaning and preprocessing 
**Often times, before we start our analyses we need to clean and treat the data in a DataFrame.**
#### Methods for data DataFrame processing

 Method | Description
 ------ | -----------
 DataFrame.drop(lable=None,axis=0,...) | removes a specific series
 DataFrame.isnull(obj)/ notnull(obj) | Create a boolean series
 DataFrame.dropna() | Deletes row or columns with null cells
 DataFrame.fillna() | Replaces null values with a determined value
 DataFrame.duplicated() | returns a boolean with duplicated values
 DataFrame.drop_duplicated() | Deletes rows with duplicated values, you can select columns using `subset`

**Note**: These methods do not modify the data frame, so it is useful to store them in a new variable.

##### Example
```python
import pandas as pd
file = 'path'
df = pd.read_csv(file)
df.head([10]).drop(['Yield (%)','Reaction Time (min)'], axis =0)
df.head([10]).notnull()
df.head([10]).fillna(999)
```

### Joining tables
***Join Method***: This method joins columns from DataFrames **On their indices**. It's important to understand how to use it carefully. as it's a powerful tool for merging infomation from databases.

#### Format
`DataFrame.join(other, on = None, how = 'left', lsuffix = '', rsuffix = '', sort = False)`

*Observations* :
- `Other` : Another object (DataFrame, series or a list of DataFrames)
- `How` : `left` = index of DataFrame, `right` = index of *other*, `outer` : union, `inner` : intersection
- `lsuffix`, `rsuffix` : when there are duplicate column names, we modify the suffix of one of them.

##### Example 
```python
import pandas as pd
df1 = pd.DataFrame({'Employee_id' : [1,2,3,4]
                    'Employee_Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2= pd.DataFrame({'Employee_id': [3,4,5,6]
                    'Department': ['HR', 'IT', 'Finance', 'Marketing']}
                    'Employee_Name': ['Charlie', 'David', 'Rafael', 'Lucas'])
df1 = set_index('Employee_id', inplace = True)
df2 = set_index('Employee_id', inplace = True) # Setting the Employee id as the joining index
df3 = df1.joing(df2, how = 'outer', rsuffix = 'b')
```

### Concat Function
The concat function is used to concatenate DataFrames, whether it's adding them in terms of rows (`axis = 0`) or columns (`axis = 1`).
#### Format
`pd.concat(objs, axis = 0, join = 'outer', ignore_index = False , key = None, levels = None, name = None, verify_integrity = False, sort = False, copy = False )`

**Observations** : 
- `objs`: it's a single argument, so it can be a list with DataFrames.
- `ignore_index`: renumerate the indices.
- `join`, `outer` : union, `inner`: intersection.

##### Example
```python
import pandas as pd
df1 = pd.DataFrame({'Employee_id' : [1,2,3,4]
                    'Employee_Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2= pd.DataFrame({'Employee_id': [3,4,5,6]
                    'Department': ['HR', 'IT', 'Finance', 'Marketing']}
                    'Employee_Name': ['Charlie', 'David', 'Rafael', 'Lucas'])
df3 = pd.DataFrame({'Employee_id': [2,3,4,7],
                    'Salary': [70000, 80000, 90000, 100000]})
df1 = set_index('Employee_id', inplace = True)
df2 = set_index('Employee_id', inplace = True) # Setting the Employee id as the joining index
df3 = set_index('Employee_id', inplace = True)
frames = [df1, df2, df3]
dfs_concated = pd.concat(frames, axis = 1, join = 'inner', ignore_index = False)
```
### Pivot Tables
A pivot table is a tool that allows us to make different groupings of our information.

#### Syntax
`pd.pivot_table(data, values = None, index = None, columns = None, aggfunc = 'mean', fill_value = None, margins = False, dropna = True, margins_name= 'All', observed = False, sort = True)`

**Notes** : 
`aggfunc` can be `mean`,`sum`,`min`,`max`

##### Usage Example 
`pd.pivot_table(df, index = 'Energy Source', aggfunc = 'mean', values = 'Efficiency (%)', columns = 'Location')`






## Practice Exercises
### 1. Chat GPT set of exercises 
- Create a DataFrame
- Display the first 2 rows.
- Show all column names and data types.
- Select only the `Name` and `Salary` columns
#### Indexing & Filtering
**Task**:
- Select all people older than 30.
- Show rows where Salary > 3000 and City == 'London' or 'New York'.

#### Sorting & Aggregation
- Sort the DataFrame by Salary (descending).
- Find the average salary.
- Get the maximum age.

In [3]:
import pandas as pd 
# creating a dataframe
df = pd.DataFrame({'Name': ['John', 'Alice', 'Bob', 'Emma'],
                    'Age': [28, 24, 35, 30],
                    'City': ['Paris', 'London', 'New York', 'Berlin'],
                    'Salary': [50000, 60000, 55000, 70000]})
print('The DataFrame is:\n', df)
# Display the first 2 rows
print('\nThe first 2 rows are:\n', df.head(2))
# Show all column names and data types
print('\nColumn names and data types:\n', df.dtypes)
# Select only the `Name` and `Salary` columns
print('\nName and Salary columns:\n', df[['Name', 'Salary']])

# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print('\nRows where Age > 30:\n', filtered_df)

# showing rows where salary > 300 and city is 'London' or 'New York'
filtered_df2 = df[(df['Salary'] > 30000) & ((df['City'] == 'London') | (df['City'] == 'New York'))]
print('\nRows where Salary > 30000 and City is London or New York:\n', filtered_df2)

# sorting the dataframe by salaray descending
sorted_df = df.sort_values(by='Salary', ascending=False)
print('\nDataFrame sorted by Salary (descending):\n', sorted_df)

# finding the average salary
average_salary = df['Salary'].mean()
print('\nAverage Salary:', average_salary)

# finding the  maximum age
max_age = df['Age'].max()
print('\nMaximum Age:', max_age)



The DataFrame is:
     Name  Age      City  Salary
0   John   28     Paris   50000
1  Alice   24    London   60000
2    Bob   35  New York   55000
3   Emma   30    Berlin   70000

The first 2 rows are:
     Name  Age    City  Salary
0   John   28   Paris   50000
1  Alice   24  London   60000

Column names and data types:
 Name      object
Age        int64
City      object
Salary     int64
dtype: object

Name and Salary columns:
     Name  Salary
0   John   50000
1  Alice   60000
2    Bob   55000
3   Emma   70000

Rows where Age > 30:
   Name  Age      City  Salary
2  Bob   35  New York   55000

Rows where Salary > 30000 and City is London or New York:
     Name  Age      City  Salary
1  Alice   24    London   60000
2    Bob   35  New York   55000

DataFrame sorted by Salary (descending):
     Name  Age      City  Salary
3   Emma   30    Berlin   70000
1  Alice   24    London   60000
2    Bob   35  New York   55000
0   John   28     Paris   50000

Average Salary: 58750.0

Maximum Age: 3

#### Intermediate Exercises
**GroupBy**
Suppose you extend your dataset with multiple entries for employees (like repeated names with different months of salary).
- Group by Name and calculate the mean salary.
- Group by City and count how many employees are in each city.

In [None]:
import pandas as pd
# Extending the dataframe with multiple entries for each person (different months of salary)
data = {
    'Name': ['John', 'John', 'Alice', 'Alice', 'Bob', 'Bob', 'Emma', 'Emma'],
    'Month': ['January', 'February', 'January', 'February', 'January', 'February', 'January', 'February'],
    'Salary': [5000, 5200, 6000, 6100, 5500, 5600, 7000, 7200],
    'City': ['Paris', 'Paris', 'London', 'London', 'New York', 'New York', 'Berlin', 'Berlin']
}
df_extended = pd.DataFrame(data)
print('\nExtended DataFrame:\n', df_extended)
# Group by `Name` and calculate the mean salary for each person
mean_salary_df = df_extended.groupby('Name')['Salary'].mean().reset_index()
print('\nMean Salary for each person:\n', mean_salary_df)

# groupby the city and calculate how many employees for each city
employee_count_df = df_extended.groupby('City')['Name'].nunique().reset_index(name='Employee Count')
print('\nEmployee count for each city:\n', employee_count_df)




Extended DataFrame:
     Name     Month  Salary      City
0   John   January    5000     Paris
1   John  February    5200     Paris
2  Alice   January    6000    London
3  Alice  February    6100    London
4    Bob   January    5500  New York
5    Bob  February    5600  New York
6   Emma   January    7000    Berlin
7   Emma  February    7200    Berlin

Mean Salary for each person:
     Name  Salary
0  Alice  6050.0
1    Bob  5550.0
2   Emma  7100.0
3   John  5100.0

Employee count for each city:
        City  Employee Count
0    Berlin               1
1    London               1
2  New York               1
3     Paris               1


In [8]:
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'Name': ['John', 'Alice', 'Bob', 'Emma'],
                    'Age': [28, 24, 35, 30],
                    'City': ['Paris', 'London', 'New York', 'Berlin'],
                    'Salary': [50000, 60000, 55000, 70000]})
print('The DataFrame is:\n', df)
# Extending the dataset with different entries
df = pd.DataFrame({'Name': ['John', 'Alice', 'Bob', 'Emma', 'Liam', 'Noah'],
                    'Age': [28, 24, 35, 30, None, 30],
                    'City': ['Paris', 'London', 'New York', 'Berlin', 'Paris', None],
                    'Salary': [50000, 60000, 55000, 70000, 65000, None]})
print('\nExtended DataFrame with missing values:\n', df)
# filling missing age with the mean age
df.fillna({'Age': df['Age'].mean()}, inplace=True)
print('\nDataFrame after filling missing Age with mean:\n', df)
# drop rows where salary is missing
df.dropna(subset=['Salary'], inplace=True)
print('\nDataFrame after dropping rows with missing Salary:\n', df)


The DataFrame is:
     Name  Age      City  Salary
0   John   28     Paris   50000
1  Alice   24    London   60000
2    Bob   35  New York   55000
3   Emma   30    Berlin   70000

Extended DataFrame with missing values:
     Name   Age      City   Salary
0   John  28.0     Paris  50000.0
1  Alice  24.0    London  60000.0
2    Bob  35.0  New York  55000.0
3   Emma  30.0    Berlin  70000.0
4   Liam   NaN     Paris  65000.0
5   Noah  30.0      None      NaN

DataFrame after filling missing Age with mean:
     Name   Age      City   Salary
0   John  28.0     Paris  50000.0
1  Alice  24.0    London  60000.0
2    Bob  35.0  New York  55000.0
3   Emma  30.0    Berlin  70000.0
4   Liam  29.4     Paris  65000.0
5   Noah  30.0      None      NaN

DataFrame after dropping rows with missing Salary:
     Name   Age      City   Salary
0   John  28.0     Paris  50000.0
1  Alice  24.0    London  60000.0
2    Bob  35.0  New York  55000.0
3   Emma  30.0    Berlin  70000.0
4   Liam  29.4     Paris  65000