
<body style="font-family: Arial, sans-serif; background-color: #D9EDF7; margin: 0; padding: 0; width:100%; scroll-behavior: smooth;">

<div style="width: 100%; margin: 20px auto; padding: 20px; background-color: #D9EDF7;">
  <h1 style="font-size: 36px; color: #333; margin-bottom: 20px; ">Pandas Data Processing</h1>

  <h2 style="font-size: 24px; color: #666; margin-bottom: 10px; color:#3183BB">Basic Functionality</h2>
  <ul style="list-style-type: none; margin: 0; padding: 0;">
      <a href="#section1" style="text-decoration:none"> <li style="margin-bottom: 10px;">Head and Tail</li></a>
      <a href="#section2" style="text-decoration:none"> <li style="margin-bottom: 10px;">Attributes and the raw values</li></a>
   <a href="#section3" style="text-decoration:none"> <li style="margin-bottom: 10px;">Descriptive statistics</li></a>
    <a href="#section4" style="text-decoration:none"><li style="margin-bottom: 10px;">Summarizing data: describe</li></a>
   <a href="#section5" style="text-decoration:none"> <li style="margin-bottom: 10px;">Index of Min/Max Values</li></a>
   <a href="#section6" style="text-decoration:none"> <li style="margin-bottom: 10px;">Value counts (histogramming) / Mode</li></a>
   <a href="#section7" style="text-decoration:none"> <li style="margin-bottom: 10px;">Discretization and quantiling</li></a>
  </ul>

  <h2 style="font-size: 24px; color: #666; margin-bottom: 10px; color:#3183BB">Function Application</h2>
  <ul style="list-style-type: none; margin: 0; padding: 0;">
  <a href="#section8" style="text-decoration:none">  <li style="margin-bottom: 10px;">Row or Column-wise Function Application</li></a>
   <a href="#section9" style="text-decoration:none"> <li style="margin-bottom: 10px;">Applying elementwise Python functions</li>
      </a></ul>
    <section style= "position: absolute; bottom:0;right:0; opacity:0.3; ">By:Krishna Panthi</section>
</div>
    
    
</body>

<img src="DP.png">

### Basic Functionality

<div class="alert alert-block alert-info">
<div id="section1"><b>Head and Tail</b></div>
<p>To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.</p></div>

In [None]:
#Importing numpy and Pandas libs
import numpy as np
import pandas as pd

In [None]:
long_series = pd.Series(np.random.randn(1000))
print(long_series.head())  # head() function gives the top 5 values of dataset

In [None]:
long_series.tail(3)  #Using Custom number as parameter in tail function

<div class="alert alert-block alert-info">
<div id="section2"><b>Attributes and the raw values</b></div><br>
Pandas objects have a number of attributes enabling you to access the metadata
<b><br>shape:</b> gives the axis dimensions of the object, consistent with ndarray
<b><br>Axis labels:<br></b>
- Series: index (only axis) <br>
- DataFrame: index (rows) and columns</div>

In [None]:
#Some example of the different attributes
series_data = pd.Series([10, 20, 30, 40, 50])  # Creating a Series

In [None]:
print(series_data.shape)  # Output: Shape of the Series (rows, columns): (5,)

In [None]:
data1 = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],    #The variable data1 is a dictionary data type in Python
        'Age': [25, 30, 35, 40, 45],
        'Gender': ['Female', 'Male', 'Male', 'Male', 'Female']}

In [None]:
my_df = pd.DataFrame(data1)
print(my_df)

In [None]:
df_shape = my_df.shape # Using the shape attribute to get the dimensions
df_shape

In [None]:
df_index = my_df.index  # Output: Index of the DataFrame (axis label for rows): RangeIndex(start=0, stop=5, step=1)
df_columns = my_df.columns # Output: Columns of the DataFrame (axis label for columns)
df_columns

<div class="alert alert-block alert-success">  
<b>Output:</b> Shape of the DataFrame (rows, columns): (5, 3)  
</div>

In [None]:
series_data = pd.Series([10, 20, 30, 40, 50])
series_index = series_data.index   # Accessing axis labels (index for Series, index and columns for DataFrame)
series_index
#series_data

<div class="alert alert-block alert-info">
<div id="section3"><b>Descriptive statistics</b></div><br>
A large number of methods for computing descriptive statistics and other related operations on Series and DataFrame.
<b><br>Series:</b>no axis argument needed
<b><br>DataFrame:</b>“index” (axis=0, default), “columns” (axis=1)

In [None]:
#Perform mean function 
series_mean = series_data.mean() 
series_mean

In [None]:
#Perform sum function
arr1= np.array([[2,3,4,5,3,2],[2,3,2,6,7,3]])
my_df2 = pd.DataFrame(arr1)  # Create a DataFrame from the array

column_sum = my_df2.sum(axis=0)  # Calculate the sum of columns (axis 0)

row_sum = my_df2.sum(axis=1) # Calculate the sum of rows (axis 1)

total_sum = my_df2.values.sum()  # Calculate the sum of all elements

print(column_sum)
print(row_sum)
print(total_sum)

In [None]:
"""
In Pandas, the skipna parameter in the sum() function determines whether 
to exclude missing values (NaN values) when computing the sum of elements
in a Series or DataFrame.
"""
mydata2 = pd.Series([1, 2, np.nan, 4, 5])
sum_with_nan = mydata2.sum(skipna=False)
sum_without_nan = mydata2.sum(skipna=True)
print("Sum with NaN (skipna=False):", sum_with_nan)  
print("Sum without NaN (skipna=True):", sum_without_nan)

<div class="alert alert-block alert-success">  
<b>Output:</b> Sum with NaN (skipna=False): nan
<br><b>Output:</b> Sum without NaN (skipna=True): 12.0
</div>

<div class="alert alert-block alert-info">
<div id="section4"><b>Summarizing data</b></div><br>
There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame

In [None]:
my_df2.describe()

<div class="alert alert-block alert-info">
<div id="section5"><b>Index of Min/Max Values</b></div><br>
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values

In [None]:
mydata3 = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Score': [85, 92, 78, 95]}
mydf3 = pd.DataFrame(mydata3)
mydf3

In [None]:
min_score_index = mydf3['Score'].idxmin()                        # Finding the index label with the minimum score
print("Index label with the minimum score:", min_score_index)

max_score_index = mydf3['Score'].idxmax()                         # Finding the index label with the maximum score
print("Index label with the maximum score:", max_score_index)

<div class="alert alert-block alert-info">
<div id="section6"><b>Value counts (histogramming) / Mode</b></div><br>
The value_counts() Series method and top-level function computes a histogram of a 1D array of values.


In [None]:
#The value_counts() function in Pandas is used to count the occurrences of unique values in a Series.
mydata4= np.array([2,3,4,2,1,5,2,6,9,7,8,4,7,3,2,3,98,3,4,5,2,5]) 
mydf4 = pd.DataFrame(mydata4)
mydf4.value_counts()

<div class="alert alert-block alert-info">
<div id="section7"><b>Discretization and quantiling</b></div><br>
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions
<ur><li>The cut() function is used to discretize continuous data into intervals (bins) based on specific values.</li>
<li>The qcut() function is used to discretize continuous data into intervals (bins) based on sample quantiles.</li></ur> </div>

In [None]:
mydata5= data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Gina'],
        'Age': [22, 34, 28, 31, 25, 39, 45]}
mydf5 = pd.DataFrame(mydata5)

In [None]:
# Discretize ages using cut() into three age groups: Young, Middle-aged, and Elderly
age_bins = [0, 30, 40, float('inf')]  # Define bins based on values
age_labels = ['Young', 'Middle-aged', 'Elderly']
mydf5['Age_Group'] = pd.cut(mydf5['Age'], bins=age_bins, labels=age_labels)
print(mydf5)

In [None]:
# Discretize ages using qcut() into three quantile-based age groups
quantile_labels = ['Group 1', 'Group 2', 'Group 3']
mydf5['Age_Quantile_Group'] = pd.qcut(mydf5['Age'], q=3, labels=quantile_labels)
print(mydf5)

### Function Application

<div class="alert alert-block alert-info">
<div id="section8"><b>Row or Column-wise Function Application</b></div><br>
Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, take an optional axis argument

In [None]:
mydata6 = {'A': [1, 2, 3],     #defining a dictionary data type variable
        'B': [4, 5, 20],
        'C': [7, 8, 32]}
mydf6 = pd.DataFrame(mydata6)

In [None]:
# Define a custom function to calculate the difference between the maximum and minimum values in a Series
def difference_max_min(series):
    return series.max() - series.min()

In [None]:
# Apply the custom function along the rows (axis=0) of the DataFrame
row_difference = mydf6.apply(difference_max_min, axis=0)
print("Difference between max and min values along rows (axis=0):")
print(row_difference)

In [None]:
# Apply the custom function along the columns (axis=1) of the DataFrame
column_difference = mydf6.apply(difference_max_min, axis=1)
print("Difference between max and min values along columns (axis=1):")
print(column_difference)

<div class="alert alert-block alert-info">
<div id="section9"><b>Applying elementwise Python functions</b></div><br>
Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.

In [None]:
def square(x):
    return x ** 2

In [None]:
result_df = mydf6.applymap(square)
print(mydf6)
result_df
print(result_df)