## What is Pandas?

Pandas is a popular open-source data manipulation and analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets, along with functions for reading and writing data in different file formats. The primary data structures in pandas are:

1. **Series:** One-dimensional labeled array capable of holding any data type. It is similar to a **column** in a spreadsheet or a single column in a DataFrame.

2. **DataFrame:** A two-dimensional table with labeled axes (rows and columns). It is the primary data structure used in pandas and can be thought of as a container for Series objects.

Some key features and functionalities of pandas include:

- **Integration with NumPy:** Pandas is built on top of the NumPy library, which provides high-performance numerical operations. This integration allows for seamless interaction between NumPy and pandas.

- **Data I/O:** Pandas supports various file formats, including CSV, Excel, SQL databases, and more, making it easy to import and export data.

- **Data Exploration:** It allows for easy data exploration and manipulation, such as filtering, grouping, and aggregating data.

- **Data Cleaning:** Pandas provides functions to handle missing data, duplicate values, and other common data cleaning tasks.

- **Time Series Data:** It has robust support for working with time series data, making it suitable for analyzing temporal data.



## Series

A series is a one dimensional array-like object that contains a sequence of values with associated labels, called **index**. All item in a series contains the same type of data which is similar to numpy's **homogenous property**. Here are several ways to create a Series in pandas:

1. **From a List:**
   You can create a Series from a Python list.

    ```python
    import pandas as pd
    data_list = [1, 2, 3, 4, 5]
    series_from_list = pd.Series(data_list)
    ```

2. **From a NumPy Array:**
   Pandas Series can be created from a NumPy array.

    ```python
    import pandas as pd
    import numpy as np
    data_array = np.array([1, 2, 3, 4, 5])
    series_from_array = pd.Series(data_array)
    ```

3. **From a Dictionary:**
   Keys of the dictionary become the index of the Series, and values become the data.

    ```python
    import pandas as pd
    data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
    series_from_dict = pd.Series(data_dict)
    ```

**Specifying Index:**

When you create a pandas Series from a list-like object, by default, it is assigned a numerical index. However, if you wish to customize the index labels for better identification, you can do so by using the `index` parameter. This parameter allows you to explicitly specify the index labels you want associated with each element in the Series.

For instance, consider the following code:

```python
import pandas as pd

# Sample data
data = [10, 20, 30, 40, 50]

# Default Series with numerical index
default_series = pd.Series(data)

# Creating a Series with a custom index
custom_index = ['a', 'b', 'c', 'd', 'e']
series_with_index = pd.Series(data, index=custom_index)
```

In the `default_series`, the default numerical index will be assigned. However, in `series_with_index`, we use the `index` parameter to specify a custom index, resulting in a Series where each element is associated with a label ('a', 'b', 'c', 'd', 'e') for easier reference and interpretation.

In [7]:
import pandas as pd
data_list = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data_list)

data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
series_from_dict = pd.Series(data_dict)

print(series_from_list)
print(series_from_dict)

0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64


In [9]:
#Custom Index
import pandas as pd

# Sample data
data = [10, 20, 30, 40, 50]

# Default Series with numerical index
default_series = pd.Series(data)

# Creating a Series with a custom index
custom_index = ['a', 'b', 'c', 'd', 'e']
series_with_index = pd.Series(data, index=custom_index)

print(series_with_index)

a    10
b    20
c    30
d    40
e    50
dtype: int64


### Accessing Items of a Series

1. **Accessing by Index:** You can access elements in a Series using the index label.

   ```python
   import pandas as pd
   series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
   item_a = series['a']
   ```

2. **Accessing by Position (Integer Indexing):** You can retrieve elements by their integer positions using `iloc`, which functions similarly to selecting items based on their index positions in a Python list.

   ```python
   # Accessing by integer position
   item_0 = series.iloc[0]
   ```

3. **Slicing:** You can use slicing to select multiple items based on their positions.

   ```python
   # Slicing by integer positions
   sliced_series = series[1:3]
   ```

4. **Boolean Indexing:** You can use boolean indexing to select items based on a condition.

   ```python
   # Boolean indexing
   condition = series > 15
   filtered_series = series[condition]
   ```

5. **Fancy Indexing:** You can use a list of labels or positions for selection.

   ```python
   # Fancy indexing
   items = series[['a', 'c']]
   ```

## DataFrame

Pandas DataFrame is a 2 dimensional data structure with rows and columns. It is similar to a google sheet or excel file with more than one column. Here are several ways you can create a DataFrame in pandas:

1. **From a Dictionary of Lists:**
   You can create a DataFrame from a dictionary where keys are column names and values are lists.

    ```python
    import pandas as pd
    data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
                 'Age': [25, 30, 35],
                 'City': ['New York', 'San Francisco', 'Los Angeles']}
    df = pd.DataFrame(data_dict)
    ```

2. **From a List of Lists:**
   Create a DataFrame directly from a list of lists. The inner lists represent rows.

    ```python
    import pandas as pd
    data_list = [['Alice', 25, 'New York'],
                 ['Bob', 30, 'San Francisco'],
                 ['Charlie', 35, 'Los Angeles']]
    df = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])
    ```

3. **From a List of Dictionaries:**
   If your data is in the form of a list of dictionaries, each dictionary represents a row.

    ```python
    import pandas as pd
    data_list_of_dicts = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
                          {'Name': 'Bob', 'Age': 30, 'City': 'San Francisco'},
                          {'Name': 'Charlie', 'Age': 35, 'City': 'Los Angeles'}]
    df = pd.DataFrame(data_list_of_dicts)
    ```

4. **From a NumPy Array:**
   You can create a DataFrame from a NumPy array and specify column names.

    ```python
    import pandas as pd
    import numpy as np
    data_array = np.array([[1, 2, 3],
                           [4, 5, 6],
                           [7, 8, 9]])
    df = pd.DataFrame(data_array, columns=['A', 'B', 'C'])
    ```

5. **From a CSV File:**
   Read data from a CSV file and create a DataFrame.
  [(Download Irish Dataset)](https://drive.google.com/file/d/1ttfIhaEbk-QmMhM9Ie5YLEW9kozECkbc/view?usp=sharing)

    ```python
    import pandas as pd
    df = pd.read_csv('iris.csv')
    ```

### Accessing Columns of a Dataframe

1. **Using Bracket Notation:**
   ```python
   # Selecting a single column
   column_data = df['column_name']
   # Selecting multiple columns
   selected_columns = df[['column_name1', 'column_name2']]
   ```

2. **Using Dot Notation (if column names are valid Python identifiers):**
   ```python
   # Selecting a single column
   column_data = df.column_name
   # Note: This method is not suitable if column names have spaces or special characters.
   ```
3. **Selecting Columns by Data Type:**
   ```python
   # Selecting columns of a specific data type (e.g., numerical columns)
   selected_columns = df.select_dtypes(include='number')
   ```


4. **Filtering Columns by Name:**
   ```python
   # Selecting columns with names containing a substring
   selected_columns = df.filter(like='partial_column_name')
   ```



In [10]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 22, 35, 28],
    'Score': [85, 92, 78, 95, 89],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}

df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print(df['Name'])
display(df)

row1      Alice
row2        Bob
row3    Charlie
row4      David
row5       Emma
Name: Name, dtype: object


Unnamed: 0,Name,Age,Score,City
row1,Alice,25,85,New York
row2,Bob,30,92,San Francisco
row3,Charlie,22,78,Los Angeles
row4,David,35,95,Chicago
row5,Emma,28,89,Boston


### Accessing Rows of a Dataframe

   - **Integer Indexing (`iloc`):** Select a specific item by providing integer positions.
     ```python
     # Selecting a single row by index
     row = df.iloc[2]

     # Selecting multiple rows by index
     rows = df.iloc[2:5]  # selects rows 2 through 4
     ```

   - **Label Indexing (`loc`):** Select a specific item by providing row label.
     ```python
     # Selecting a single row by label
     row = df.loc['row_label']

     # Selecting multiple rows by label
     rows = df.loc['row_label_1':'row_label_3']
     ```

   - **Conditional Selection:** Select items that satisfy a condition.
     ```python
     # Selecting rows based on a condition
     condition = df['column_name'] > 50
     selected_rows = df[condition]
     ```

   - **Selecting Rows with Specific Values of a column:**
     ```python
     # Selecting rows with specific values in a column
     selected_rows = df[df['column_name'].isin(['value1', 'value2'])]
     ```

In [15]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 22, 35, 28],
    'Score': [85, 92, 78, 95, 89],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}

df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print(df.loc['row1'])

Name        Alice
Age            25
Score          85
City     New York
Name: row1, dtype: object


### Accessing Cells

1. **Using `loc` with Row and Column Labels:** Select specific rows and columns by providing labels.

    ```python
    import pandas as pd
    # Assuming 'df' is your DataFrame
    selected_data = df.loc[['row1', 'row2'], ['col1', 'col2']]
    ```

2. **Using `iloc` with Integer Positions:** Select specific rows and columns by providing integer positions.

    ```python
    import pandas as pd
    # Assuming 'df' is your DataFrame
    selected_data = df.iloc[[0, 1], [0, 1]]
    ```
   Here, replace 0, 1 with the actual integer positions of rows and columns you want to select.

3. **Selecting a Range of Rows and Columns:** You can also use slices with `loc` and `iloc` to select ranges of rows and columns.

    ```python
    import pandas as pd
    # Using loc
    selected_data_loc = df.loc['start_row':'end_row', 'start_col':'end_col']
    # Using iloc
    selected_data_iloc = df.iloc[start_row_position:end_row_position, start_col_position:end_col_position]
    ```

In [17]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 22, 35, 28],
    'Score': [85, 92, 78, 95, 89],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}

df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print(df.loc[['row2'], ['Age']])

      Age
row2   30


### Renaming Columns

In pandas, you can rename columns using the `rename` method or by directly assigning new column names to the `columns` attribute. Here are both methods:

#### 1. Using `rename` method:

```python
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Rename columns using the 'rename' method
df.rename(columns={'A': 'X', 'B': 'Y'}, inplace=True)
```

In the example above, the `rename` method is used with a dictionary to specify the mapping of old column names to new column names. The `inplace=True` argument modifies the original DataFrame in place.

#### 2. Directly assigning new column names:

```python
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Assign new column names directly to the 'columns' attribute
df.columns = ['X', 'Y']
```

### Renaming Indices

In pandas, you can rename indices using the `rename` method. Here's an example of how you can do this:

```python
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data, index=['X', 'Y', 'Z'])

# Rename the indices using the rename method
new_indices = {'X': 'W', 'Y': 'V', 'Z': 'U'}
df_renamed = df.rename(index=new_indices)

```

In this example, the `rename` method is used with the `index` parameter to rename the indices. The `new_indices` dictionary specifies the mapping between the old index values and the new ones.

## Some Useful Functions in Pandas

- `df.mean()`: Compute the mean of each column.
- `df.median()`: Compute the median of each column.
- `df.mode()`: Compute the mode of each column.

- `df.std()`: Compute the standard deviation of each column.
- `df.var()`: Compute the variance of each column.
- `df.min()` and `df.max()`: Compute the minimum and maximum values of each column.

- `df.quantile(q)`: Compute the qth quantile of each column.
- `df.describe()`: Generate descriptive statistics, including quartiles.


- `df + scalar`: Add a scalar to each element in the DataFrame.
- `df - scalar`: Subtract a scalar from each element.
- `df * scalar`: Multiply each element by a scalar.
- `df / scalar`: Divide each element by a scalar.


- `df.groupby('column').mean()`: Compute the mean for each group.
- `df.column.value_counts()`: Returns a new Series containing counts of unique values

- `df.corr()`: Compute pairwise correlation between columns.

- `df.cov()`: Compute pairwise covariance between columns.

- `df.sample(n)`: Return a random sample of items from an axis.
- `df.apply(function)`: Apply a function to each element, row, or column in the DataFrame.
- `df['column'].apply(function)`: Apply a function to each element of the specified column.

In [19]:
import pandas as pd

# Sample DataFrame
data = {
    'Product': ['A', 'A', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South', 'North', 'South'],
    'Sales': [150, 200, 120, 180, 160, 220, 140, 190]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Product,Region,Sales
0,A,North,150
1,A,North,200
2,A,South,120
3,B,South,180
4,A,North,160
5,B,South,220
6,A,North,140
7,B,South,190


In [31]:
# Group the DataFrame by the 'Product' column, creating a GroupBy object
grouped_df = df.groupby(['Product'])
print(grouped_df)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000169CA483520>


In [25]:
# Find the product with the highest average sales
product_with_highest_avg_sales = grouped_df['Sales'].mean().idxmax()
print(product_with_highest_avg_sales)

B


In [22]:
# Count the number of occurrences of each unique value in the 'Product' column
df.Product.value_counts()

Product
A    5
B    3
Name: count, dtype: int64

In [21]:
# Applying a function to each element in the 'Sales' column
df['Sales_With_Tax'] = df['Sales'].apply(lambda x: x * 1.1)
df

Unnamed: 0,Product,Region,Sales,Sales_With_Tax
0,A,North,150,165.0
1,A,North,200,220.0
2,A,South,120,132.0
3,B,South,180,198.0
4,A,North,160,176.0
5,B,South,220,242.0
6,A,North,140,154.0
7,B,South,190,209.0
