

**Core Concepts of a Pandas DataFrame**

*   A DataFrame is a **two-dimensional data structure** that organizes data into rows and columns, resembling a table.
*   It's the primary way to handle tabular data in Pandas.
*   Each column in a DataFrame is a Pandas Series, and each row can also be considered a Series.

**Creating DataFrames**

*   DataFrames can be created from various sources:
    *   **2D Lists**: A list of lists can be converted into a DataFrame, with column names specified.
        ```python
        student_data = [,,,]
        df = pd.DataFrame(student_data, columns=['ID', 'Marks', 'Package'])
        ```
    *   **Dictionaries**: Dictionaries can be used, where keys become column names.
        ```python
        student_dict = {'ID':, 'Marks':, 'Package':}
        df = pd.DataFrame(student_dict)
        ```
    *   **Importing from Files**:  Real-world DataFrames are often created by reading data from files like CSVs.
        ```python
        movies = pd.read_csv('movies.csv')
        ipl_matches = pd.read_csv('ipl_matches.csv')
        ```

**Essential DataFrame Attributes**

*   **.shape**: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
    ```python
    movies.shape # Example: (1629, 18)
    ipl_matches.shape # Example: (950, 20)
    ```
*   **.dtypes**: Displays the data types of each column, which are Pandas Series with their own data types.
    ```python
    movies.dtypes
    ipl_matches.dtypes
    ```
*   **.index**: Provides the index object for the DataFrame. By default it starts at 0, but is customizable.
    ```python
    movies.index
    ipl_matches.index
    ```
*   **.columns**: Returns a list of column names.
    ```python
    movies.columns
    ipl_matches.columns
    ```
*   **.values**: Returns the data as a 2D NumPy array.
    ```python
    df.values
    ```

**Essential DataFrame Methods**

*   **.head()**: Displays the first N rows (default is 5), which is useful for previewing data.
    ```python
    movies.head(5)
    ipl_matches.head(2)
    ```
*   **.tail()**: Displays the last N rows.
    ```python
    ipl_matches.tail(5)
    ipl_matches.tail(2)
    ```
*   **.sample()**: Returns a random sample of rows, useful for unbiased data inspection.
    ```python
     ipl_matches.sample(5)
    ```
*   **.info()**: Provides a concise summary of the DataFrame, including data types, non-null counts, and memory usage. This is highly recommended when starting with a new data set.
    ```python
    movies.info()
    ipl_matches.info()
    ```
*   **.describe()**: Generates descriptive statistics (count, mean, std, min, max, percentiles) for numerical columns.
    ```python
    movies.describe()
    ipl_matches.describe()
    ```
*   **.isnull().sum()**: Calculates the total number of missing values in each column.
    ```python
    movies.isnull().sum()
    ```
*   **.duplicated().sum()**: Checks for and returns the number of duplicated rows in the DataFrame. This is important for data cleaning.
    ```python
    movies.duplicated().sum()
    ```
*   **.rename()**: Renames columns using a dictionary.
    ```python
    students.rename(columns={'Marks': 'Percent', 'Package': 'LPA'}, inplace=True)
    ```
*   **.set_index()**: Sets a column to be the index of the DataFrame.
    ```python
    students.set_index('name', inplace=True)
    ```

**Mathematical Operations**

*   Standard mathematical functions like sum, min, max, and median can be applied to DataFrames.
*    The `axis` parameter controls whether operations are applied column-wise (axis=0, default) or row-wise (axis=1).
    ```python
    students.sum() # Column wise sum
    students.sum(axis=1) # Row wise sum
    students.min() # Column wise min
    students.min(axis=1) # Row wise min
    ```

**Accessing Data**

*   **Column Selection**:
    *   Individual columns can be accessed using bracket notation or dot notation.
        ```python
        movies['title_x']
        movies.title_x
        ipl_matches['Venue']
        ```
    *   Multiple columns can be accessed by passing a list of column names using bracket notation.
        ```python
        movies[['title_x', 'actors', 'year_of_release']]
        ipl_matches[['team1', 'team2', 'winner']]
        ```
*   **Row Selection**:
    *   **.iloc**: Access rows by integer index position. Slicing works similarly to Python lists.
        ```python
        movies.iloc #single row
        movies.iloc[0:5] #multiple rows
         movies.iloc[5:15]
        movies.iloc[5:16:2] #alternate rows
        movies.iloc[5:]
        movies.iloc[:5]
         movies.iloc[] #specific rows
        ```
    *   **.loc**: Access rows by index label. Slicing works with labels, not positions.
        ```python
        students.loc['Nitish'] #single row
         students.loc['Nitish':'Rishabh'] #multiple rows
        students.loc['Nitish':'Rishabh':2] #alternate rows
        students.loc[['Ankita','Rupesh','Nitish']] #specific rows
        ```
    *   **Fancy Indexing**: Selecting multiple rows or columns using lists of indices or labels.
    *   **Boolean Indexing**: Filtering data based on conditions.
        ```python
        final_matches = ipl_matches[ipl_matches['match_number'] == 'Final']
        super_over_matches = ipl_matches[ipl_matches['super_over'] == 'Y']
        csk_matches_kolkata = ipl_matches[(ipl_matches['city'] == 'Kolkata') & (ipl_matches['winner'] == 'Chennai Super Kings')]
        toss_winner_matches = ipl_matches[ipl_matches['toss_winner'] == ipl_matches['winner']]
        good_movies = movies[(movies['imdb_rating'] > 8) & (movies['imdb_votes'] > 10000)]
        ```
    *   You can combine row and column selection using `.iloc` and `.loc`.
        ```python
        movies.iloc[0:3, 0:3]
         movies.loc[0:3, ['title_x', 'poster_path']]
        ```

**Data Manipulation**

*   **Creating New Columns**: New columns are created by assigning values directly to new column names.
    ```python
    movies['country'] = 'India'
    ```
*   New columns can also be created using existing column data using string methods, and other functions.
    ```python
     movies['lead_actor'] = movies['actors'].str.split(',').str
    ```
*   **Applying Functions**: Functions can be applied to series or the whole dataframe using the `.apply()` method.
    ```python
     action_movies = movies[movies['genres'].str.split(',').apply(lambda x: 'action' in x if isinstance(x, list) else False)]
    ```
*  **Data Type Conversion**: The data type of columns can be changed using `.astype()`. This can reduce memory consumption.
     ```python
     ipl_matches['id'] = ipl_matches['id'].astype('int32')
     ipl_matches['season'] = ipl_matches['season'].astype('category')
     ipl_matches['team1'] = ipl_matches['team1'].astype('category')
     ipl_matches['team2'] = ipl_matches['team2'].astype('category')
     ipl_matches['toss_winner'] = ipl_matches['toss_winner'].astype('category')
    ```
   
**Filtering Data with Boolean Indexing**

*   Boolean indexing is a powerful technique for filtering data based on conditions.
*   You can use comparison operators (`==`, `>`, `<`, `>=`, `<=`) to create boolean masks.
*   Multiple conditions can be combined using logical operators (`&` for "and", `|` for "or").
*   When filtering with multiple conditions, each condition should be enclosed in parentheses.
    ```python
    csk_matches_kolkata = ipl_matches[(ipl_matches['city'] == 'Kolkata') & (ipl_matches['winner'] == 'Chennai Super Kings')]
     good_movies = movies[(movies['imdb_rating'] > 8) & (movies['imdb_votes'] > 10000)]
    ```

**Additional Notes:**

*   It is important to import data from external files instead of manually creating the data.
*   Starting with `.info()` is a best practice to understand the data.
*  `.isnull().sum()` is useful for checking for missing values.
*   `.duplicated()` is essential for data cleaning.
*   `iloc` uses an implicit integer index, while `loc` uses an explicit index label.
*    Changing data types can reduce memory usage.






**Key Pandas DataFrame Methods**

*   **`value_counts()`**: This method **counts the frequency of each unique value** in a Series or DataFrame.
    *   Syntax: `df['column_name'].value_counts()` for a single column (Series) or `df.value_counts()` for the entire DataFrame.
    *   When applied to a Series, it returns a Series showing each unique value and its count.
    *   When applied to a DataFrame, it counts the occurrences of each unique row.
    *   It can be used to analyze the distribution of values in a column or the frequency of certain rows in the entire DataFrame.
    *   In real-life data analysis, it is more useful on a Series than on an entire DataFrame.
    *   For example, to find the number of times each player won the "Man of the Match" award, or to see how many times each team batted first after winning the toss.
    *   It can also be used in conjunction with other methods such as `plot()` to create visualizations like pie charts.
    *   It can be used to count the number of times each team has appeared in the `Team 1` or `Team 2` columns, to determine the total number of matches played by each team.

*   **`sort_values()`**: This method **sorts a DataFrame by the values in one or more columns**.
    *   Syntax: `df.sort_values(by='column_name', ascending=True/False)` for sorting by a single column. Use `df.sort_values(by=['col1', 'col2'], ascending=[True, False])` for sorting by multiple columns with different orders.
    *   The `ascending` parameter specifies the sorting order. The default is `True` for ascending order. Set it to `False` for descending order.
    *   When sorting by multiple columns, the DataFrame is first sorted by the first column specified, then by the second column within each group of the first column, and so on.
    *   Missing values are placed at the end by default during sorting, but this behavior can be changed using the `na_position` parameter with options `first` or `last`.
     *   The `inplace=True` parameter modifies the original DataFrame directly.
     *   For example, it can be used to sort movies by their release year and then alphabetically by title within each year.

*   **`rank()`**: This method **assigns ranks to values** within a Series.
    *   Syntax: `series.rank(ascending=True/False)`
    *   It is applicable only to Series, not directly to DataFrames.
    *   The ranks can be generated in ascending or descending order based on the values of the Series.
    *   By default, it ranks in ascending order.
    *   It can be used to rank batsmen based on their total runs scored in a cricket tournament.
    *   If there are ties in values, they get the same rank.

*   **`sort_index()`**: This method **sorts a DataFrame or Series by its index**.
    *   Syntax: `df.sort_index(ascending=True/False)` or `series.sort_index(ascending=True/False)`.
    *   It sorts the index either in ascending or descending order.
    *   It can be applied to both Series and DataFrames.
    *   It can be used to sort a DataFrame by date or some other index.

*   **`set_index()`**: This method **sets a column as the index of a DataFrame**.
    *   Syntax: `df.set_index('column_name', inplace=True)`
    *   It transforms a column into the DataFrame's index.
     *   The `inplace=True` parameter modifies the DataFrame directly.
    *   It is useful when you want to access rows by the values in a particular column, rather than the default integer index.
     * For example, if a data set has a column containing unique names, that column can be used as an index.

*   **`reset_index()`**: This method **resets the index of a DataFrame back to the default integer index**.
    *    Syntax: `df.reset_index(inplace=True)`
    *   It can also convert a Series to a DataFrame.
    *   The `inplace=True` parameter modifies the DataFrame directly.
    *   It can be used to revert the DataFrame to its original state before setting a new index.
     * It is useful if you want to make the index a regular column again.
     *   When applied to a Series, it converts the Series into a DataFrame.

*   **`rename()`**: This method **renames columns or index labels in a DataFrame**.
    *   Syntax: `df.rename(columns={'old_name': 'new_name'}, index={'old_label': 'new_label'}, inplace=True)`
    *   It can be used to change the names of columns or the labels of the index.
    *   The `inplace=True` parameter modifies the DataFrame directly.
    *   You can pass a dictionary where keys are old names/labels and values are the new names/labels.
    *   It can be used to make column names more descriptive or change the labels of the index.

*   **`unique()`**: This method **returns unique values in a Series or column**.
    *   Syntax: `series.unique()`
    *   It returns an array of unique values.
    *   It can be used to identify the unique categories or values in a column.
    *   It counts missing values.
*   **`nunique()`**: This method **returns the number of unique values in a Series or column**.
    *   Syntax: `series.nunique()`
    *   It returns a single number representing the count of unique values.
    *   It does not count missing values.
     *   It is useful to know how many unique categories are in a column.

*   **`isnull()`**: This method **checks for missing values in a Series or DataFrame**, returning a boolean mask.
    *   Syntax: `df.isnull()` or `series.isnull()`
     *  Returns `True` for the missing values and `False` for the non-missing values.
     *   It is applicable to both Series and DataFrames.
    *   You can use it to identify where missing data is located.

*   **`notnull()`**: This method **checks for non-missing values in a Series or DataFrame**, returning a boolean mask.
    *   Syntax: `df.notnull()` or `series.notnull()`
     *  Returns `True` for the non-missing values and `False` for the missing values.
    *   It is applicable to both Series and DataFrames.
     *  It is the opposite of `isnull()`.

*   **`isna().any()`**: This method checks if there are **any missing values in a Series or DataFrame**, and returns a boolean value.
    *   Syntax: `df.isna().any()` or `series.isna().any()`
    *  Returns `True` if there is at least one missing value and `False` if there are no missing values in the series or dataframe.
     *  It is useful to quickly determine if a column has any missing values.

*   **`dropna()`**: This method **removes rows or columns with missing values** from a DataFrame or Series.
    *   Syntax: `df.dropna(axis=0 or 1, how='any' or 'all', subset=['col1', 'col2'], inplace=True)`
    *  `axis=0` removes rows, `axis=1` removes columns.
    *  `how='any'` removes rows/columns with any missing values; `how='all'` removes only those with all missing values.
    * The `subset` parameter specifies which columns to consider when dropping rows with missing values.
     * The `inplace=True` parameter modifies the DataFrame directly.
    *  It is useful when you want to eliminate rows or columns that have missing data.

*   **`fillna()`**: This method **fills missing values in a Series or DataFrame**.
    *   Syntax: `df.fillna(value, method='ffill' or 'bfill', inplace=True)` or `series.fillna(value, inplace=True)`
     *  The `value` parameter specifies the value to use to fill the missing values.
     *  The `method` parameter can be set to `ffill` (forward fill) or `bfill` (backward fill) to propagate values forward or backward.
    * The `inplace=True` parameter modifies the DataFrame directly.
    *   It is useful to impute missing data with specific values, the mean, median, or other values.

*   **`drop_duplicates()`**: This method **removes duplicate rows** from a DataFrame or duplicate values in a Series.
    *   Syntax: `df.drop_duplicates(subset=['col1', 'col2'], keep='first' or 'last')` or `series.drop_duplicates(keep='first' or 'last')`
    *   The `subset` parameter specifies the columns to consider when identifying duplicates.
    *   The `keep` parameter specifies which duplicates to keep. `first` keeps the first occurrence and `last` keeps the last occurrence of a duplicate.
    *  It can be used to eliminate duplicate data from a dataset.
    *   It is useful for cleaning datasets and ensuring that only unique rows are present.

*  **`drop()`**: This method is used to **remove rows or columns** from a DataFrame or elements from a Series.
    * Syntax: `df.drop(labels=['col1', 'col2'], axis=1, inplace=True)` for columns, or `df.drop(labels=, axis=0, inplace=True)` for rows or `series.drop(labels=)` for elements in a series.
    *   The `labels` parameter specifies which columns or rows or elements to drop.
    *   The `axis` parameter specifies the axis along which to drop; 0 for rows, and 1 for columns.
    * The `inplace=True` parameter modifies the DataFrame directly.
    * It is useful for removing unwanted data from the dataframe.

*   **`apply()`**: This method **applies a function along an axis of a DataFrame or to each value of a Series**.
    *   Syntax: `df.apply(function, axis=0 or 1)` or `series.apply(function)`.
    *   When applied to a Series, it applies the function to each value.
    *   When applied to a DataFrame, `axis=0` applies the function to each column, and `axis=1` applies the function to each row.
     *  It can be used with custom functions to perform complex data transformations.


