1.Getting familiar with pandas.

In [1]:
import pandas as pd

In [2]:
# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print("Series from list:\n", series)


Series from list:
 0    1
1    2
2    3
3    4
4    5
dtype: int64


In [3]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['sasank', 'sai', 'sasi', 'varank'],
    'Age': [28, 24, 35, 32]
}
df = pd.DataFrame(data)
print("\nDataFrame from dictionary:\n", df)


DataFrame from dictionary:
      Name  Age
0  sasank   28
1     sai   24
2    sasi   35
3  varank   32


Operations on Dataframes:

Selecting data(single column and multiple column):

In [4]:
ages = df['Age']
print("\nSelected column (Age):\n", ages)

# Selecting multiple columns
subset = df[['Name', 'Age']]
print("\nSubset of DataFrame:\n", subset)


Selected column (Age):
 0    28
1    24
2    35
3    32
Name: Age, dtype: int64

Subset of DataFrame:
      Name  Age
0  sasank   28
1     sai   24
2    sasi   35
3  varank   32


Filtering data:

In [5]:
filtered_df = df[df['Age'] > 40]
print("\nFiltered DataFrame (Age > 40):\n", filtered_df)


Filtered DataFrame (Age > 40):
 Empty DataFrame
Columns: [Name, Age]
Index: []


Modifying of data:

In [6]:
df.loc[0, 'Age'] = 35  # Update a single value
print("\nDataFrame after modifying age of the first entry:\n", df)


DataFrame after modifying age of the first entry:
      Name  Age
0  sasank   35
1     sai   24
2    sasi   35
3  varank   32


2. Data Handling with Pandas

Handling missing data:

In [7]:
df_with_missing = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [None, 2, 3, 4]
})


Checking missing data

In [8]:
print("\nDataFrame with missing values:\n", df_with_missing)
print("\nChecking for missing data:\n", df_with_missing.isnull())


DataFrame with missing values:
      A    B
0  1.0  NaN
1  2.0  2.0
2  NaN  3.0
3  4.0  4.0

Checking for missing data:
        A      B
0  False   True
1  False  False
2   True  False
3  False  False


Filling specific value

In [16]:
df_filled = df_with_missing.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled)


DataFrame after filling missing values with 0:
     A    B
0  1.0  0.0
1  2.0  2.0
2  0.0  3.0
3  4.0  4.0


Conversion of data types:

In [10]:
df['Age'] = df['Age'].astype(float)
print("\nDataFrame with 'Age' converted to float:")
print(df)


DataFrame with 'Age' converted to float:
     Name   Age
0  sasank  35.0
1     sai  24.0
2    sasi  35.0
3  varank  32.0


3. Data Analysis with Pandas

In [12]:
print("\nSummary statistics of the DataFrame:")
print(df.describe())


Summary statistics of the DataFrame:
             Age
count   4.000000
mean   31.500000
std     5.196152
min    24.000000
25%    30.000000
50%    33.500000
75%    35.000000
max    35.000000


Grouping Data and Applying Aggregate Functions:

In [17]:
grouped_df = df.groupby('Age').agg({'Age': 'mean'})
print("\nGrouped data by City with mean Age:")
print(grouped_df)


Grouped data by City with mean Age:
       Age
Age       
24.0  24.0
32.0  32.0
35.0  35.0


Merging two dataframes

In [14]:
df1 = pd.DataFrame({'Key': ['A', 'B', 'C'], 'Value': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['A', 'B', 'D'], 'Value': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='Key', how='outer')
print("\nMerged DataFrame:\n", merged_df)


Merged DataFrame:
   Key  Value_x  Value_y
0   A      1.0      4.0
1   B      2.0      5.0
2   C      3.0      NaN
3   D      NaN      6.0


Concatenation

In [15]:
concatenated_df = pd.concat([df1, df2], ignore_index=True)
print("\nConcatenated DataFrame:\n", concatenated_df)


Concatenated DataFrame:
   Key  Value
0   A      1
1   B      2
2   C      3
3   A      4
4   B      5
5   D      6


4.Application in Data Science

Advantages of Using Pandas:

Efficiency and Performance: Pandas is built on top of NumPy and provides highly optimized performance for data manipulation and analysis.

Ease of Use: Pandas offers intuitive functions and methods for data manipulation, making complex data handling tasks simpler.

Integration with Other Libraries: Pandas integrates seamlessly with other Python libraries like Matplotlib and Seaborn for visualization, and Scikit-learn for machine learning, enhancing its utility in data science workflows.

Handling Large Datasets: It provides functions to handle large datasets efficiently by allowing filtering, merging, and aggregating data quickly.

Data Cleaning: Data scientists often use Pandas to clean datasets by handling missing values, removing duplicates, and converting data types. For example, filling missing values with fillna() or dropping them with dropna().

Exploratory Data Analysis (EDA): Pandas is used to summarize data, visualize trends, and explore relationships between variables. Functions like describe(), groupby(), and plot() are commonly used for EDA.

Time Series Analysis: In finance and economics, Pandas is used for time series data analysis, allowing data scientists to resample, shift, and manipulate time series data effectively.

Conclusion

By mastering the examples above, you’ll gain a comprehensive understanding of how Pandas can be used for data handling and analysis in data science. Pandas is a powerful tool that simplifies data preparation and analysis tasks, making it indispensable for data science professionals working with real-world data