
- A Pandas DataFrame is a two-dimensional, tabular data structure with labeled axes (rows and columns).
- It is a primary data structure in the Pandas library, providing a versatile and efficient way to handle and manipulate data in Python.

## Features
- Tabular structure: The DataFrame is organized as a table with rows and columns, similar to a spreadsheet or SQL table.
- Labeled axes: Both rows and columns are labeled, allowing for easy indexing and referencing of data.
- Heterogeneous data types: Each column in a DataFrame can contain different types of data, such as integers, floats, strings, or even complex objects.
- Versatility: DataFrames can store and handle a wide range of data formats, including CSV, Excel, SQL databases, and more.
- Data alignment: Operations on DataFrames are designed to handle missing values gracefully, aligning data based on labels.

In [1]:
# Creating a Pandas DataFrame.
# methods are available within Pandas to generate a DataFrame.
# Data in Python dictionaries, lists, NumPy arrays, or external files such as CSV and Excel, can be transformed into a structured tabular format by Pandas.

import pandas as pd

# Creating a DataFrame from a dictionary
data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [25, 30, 22],
             'Salary': [50000, 60000, 45000]}

df_dict = pd.DataFrame(data_dict)
print(df_dict)

# Creating a DataFrame from lists
data_list = [['Alice', 25, 50000], ['Bob', 30, 60000], ['Charlie', 22, 45000]]

# Defining column names
columns = ['Name', 'Age', 'Salary']

df_list = pd.DataFrame(data_list, columns=columns)
print(df_list)

# Creating a DataFrame from a NumPy array
import numpy as np
data_array = np.array([['Alice', 25, 50000],
                       ['Bob', 30, 60000],
                       ['Charlie', 22, 45000]])

df_array = pd.DataFrame(data_array, columns=columns)
print(df_array)

# Creating a DataFrame from a CSV file
df_csv = pd.read_csv('HousePrices.csv')
print(df_csv)


      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   22   45000
      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   22   45000
      Name Age Salary
0    Alice  25  50000
1      Bob  30  60000
2  Charlie  22  45000
                     date         price  bedrooms  bathrooms  sqft_living  \
0     2014-05-02 00:00:00  3.130000e+05       3.0       1.50         1340   
1     2014-05-02 00:00:00  2.384000e+06       5.0       2.50         3650   
2     2014-05-02 00:00:00  3.420000e+05       3.0       2.00         1930   
3     2014-05-02 00:00:00  4.200000e+05       3.0       2.25         2000   
4     2014-05-02 00:00:00  5.500000e+05       4.0       2.50         1940   
...                   ...           ...       ...        ...          ...   
4595  2014-07-09 00:00:00  3.081667e+05       3.0       1.75         1510   
4596  2014-07-09 00:00:00  5.343333e+05       3.0       2.50         1460   
4597  2014-07-09 00:00:00  

# Accessing a Pandas DataFrame involves employing various methods for selecting and retrieving data, whether it be specific columns, rows, or individual cells.

- Utilizing square brackets, iloc and loc indexers, and conditions, analysts can navigate and extract the necessary information from the DataFrame for further analysis and manipulation.
- Pandas allows for both label-based and position-based indexing.

In [None]:
# Creating a sample DataFrame
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}

df = pd.DataFrame(data)

# Accessing a single column
column_data = df['Column_name']
print("Single column:")
print(column_data)

# Accessing multiple columns
selected_columns = df[['Column1', 'Column2']]
print("\nMultiple columns:")
print(selected_columns)

# Accessing a specific row by index
row_data = df.iloc[0]
print("\nSpecific row:")
print(row_data)

# Accessing rows based on a condition
filtered_rows = df[df['Column_name'] > 10]
print("\nFiltered rows:")
print(filtered_rows)

# Accessing a single cell by label
value = df.at[0, 'Column_name']
print("\nSingle cell by label:")
print(value)

# Accessing a single cell by position
value = df.iat[0, 1]  # Row 0, Column 1
print("\nSingle cell by position:")
print(value)

# Accessing data using .loc
selected_data = df.loc[0, 'Column_name']
print("\nData using .loc:")
print(selected_data)

# Conditional access
selected_data = df[df['Column_name'] > 10]['Another_column']
print("\nConditional access:")
print(selected_data)

- The head() and tail() methods enables a preview of the initial and final rows of a DataFrame.
- These functions are invaluable for a preliminary assessment of column names, data types, and potential issues.
- The info() method provides a summary, detailing data types, non-null counts, and memory usage, so identification of missing or inconsistent data is seen.
- The shape attribute communicates the dimensions of the DataFrame, encapsulating the number of rows and columns.

In [None]:
# Create a sample DataFrame
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}

df = pd.DataFrame(data)

# Display the first 2 rows
print("First 2 rows:")
print(df.head(2))

# Display the last row
print("\nLast row:")
print(df.tail(1))

# Provide a comprehensive summary of the DataFrame
print("\nDataFrame summary:")
df.info()

# Return a tuple representing the dimensions of the DataFrame (Rows, columns)
print("\nDataFrame dimensions:")
print(df.shape)


## Pandas supports the computation of fundamental measures such as mean and median, along with the exploration of correlations and distribution characteristics.
- The describe() function provides a quick summary, including mean, standard deviation, and quartile information.

In [None]:
# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Display descriptive statistics for numeric columns
print("Descriptive statistics for numeric columns:")
print(df.describe())

In [None]:
# Mean, Median, and Standard Deviation

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Calculate mean, median, and standard deviation
mean_value = df.mean()
median_value = df.median()
std_deviation = df.std()

print("Mean:\n", mean_value)
print("\nMedian:\n", median_value)
print("\nStandard deviation:\n", std_deviation)

In [None]:
# The corr() function generates a correlation matrix, indicating how variables relate to each other.
#Values closer to 1 or -1 imply a stronger correlation, while values near 0 suggest a weaker correlation.

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Compute correlation matrix
correlation_matrix = df.corr()

print("Correlation matrix:\n", correlation_matrix)

In [None]:
#The value_counts() function tallies the occurrences of unique values in a categorical column, aiding in understanding the distribution of categorical data.

# Create a sample DataFrame with a category column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)

# Count occurrences of unique values in the category column
value_counts = df['Category'].value_counts()

print("Value counts:\n", value_counts)