In [14]:
import pandas as pd
from io import StringIO

In [15]:
print("Pandas version,", pd.__version__)

Pandas version, 2.3.1


Pandas introduces two primary data structures: Series and DataFrame, along
with an Index that labels the data.
Series: Its a one-dimensional array of data with an index (like a labeled column of values).
DataFrame: DataFrame is a two-dimensional table of data, consisting of multiple Series 
that share the same index (like a spreadsheet or SQL table)
The index is a set of labels for each row (and each column, in the case of column names),
by default, Pandas assigns an integer index starting form 0 for rach row.
Why do we use pandas?
In real-world data analysis, you will often actually work with dataset
(csv files, databases, JSON APIS, etc). That would need cleaning, transformation and summarization.
Pandas provides high-level data structure and functions that makes theses tasks easy,
Its built on Numpy, and it gives high performace of numerical data.

In [16]:
# Create a Series of exam scores
scores = pd.Series([89, 90, 88, 88], index=["Alice", "Bob", "Charlie", "Dana"])
print("Series of Scores:")
print(scores, "\n")

Series of Scores:
Alice      89
Bob        90
Charlie    88
Dana       88
dtype: int64 



In [17]:
# Create a Dataframe of students with multiple columns
data = {
    "Name": ["Alice", "Bob", "Charlie", "Dana"],
    "Age": [45, 33, 55, 22],
    "City": ["NY", "LA", "NY", "TX"],
    "Score": [89, 90, 88, 88]
}
df = pd.DataFrame(data)
print(df)

      Name  Age City  Score
0    Alice   45   NY     89
1      Bob   33   LA     90
2  Charlie   55   NY     88
3     Dana   22   TX     88


In [18]:
print("Columns of df:", df.columns)

Columns of df: Index(['Name', 'Age', 'City', 'Score'], dtype='object')


In [19]:
# Creating csv data (comma seperated values)
csv_data = """Name,Age,City
Alice,45,NY
Bob,33,LA
Charlie,55,NY
Dana,22,TX
"""
# Use StringIo to simulate a file object from the string (for demo purposes)

df = pd.read_csv(StringIO(csv_data))
print("DataFrame read from CSV:")
print(df,"\n")

# Now write this DataFrame to an Excel file (this will actually create an actual file
df.to_excel("people.xlsx", index=False) # index=False to omits the index in the file

print("DataFrame has been written to the file")

# In the code above pd.read_csv was used to parse CSV data in practice, you would
# pd.read_csv("path/to/your/file.csv")

DataFrame read from CSV:
      Name  Age City
0    Alice   45   NY
1      Bob   33   LA
2  Charlie   55   NY
3     Dana   22   TX 

DataFrame has been written to the file


In [23]:
# JSON string example
json_data = '[{"Name": "X", "Score":5}, {"Name": "Y", "Score": 7}]'
df_json = pd.read_json(json_data)
print("DataFrame read from JSON:")
print(df_json)

# Write DataFrame to JSON File
df_json.to_json("Sample.json", orient="records")
print("DataFrame has been written to 'sample.json'")

DataFrame read from JSON:
  Name  Score
0    X      5
1    Y      7
DataFrame has been written to 'sample.json'


  df_json = pd.read_json(json_data)


In [27]:
product_data = {
    "Product": ["Widget", "Gadget"],
    "Price": [23,34],
    "Quantity": [33,44]
}
df = pd.DataFrame(product_data)
print(df)
# Write DataFrame to csv file
df.to_csv("products.csv", index=False)
print("Dataframe has been written to products.csv")

  Product  Price  Quantity
0  Widget     23        33
1  Gadget     34        44
Dataframe has been written to products.csv


In [29]:
# Reading from the csv file
df = pd.read_csv("products.csv")
print(df)

  Product  Price  Quantity
0  Widget     23        33
1  Gadget     34        44


In [30]:
# Reading from the excel file
df = pd.read_excel("people.xlsx")
print(df)

      Name  Age City
0    Alice   45   NY
1      Bob   33   LA
2  Charlie   55   NY
3     Dana   22   TX


In [31]:
# Reading from the JSON file
df = pd.read_json("Sample.json")
print(df)

  Name  Score
0    X      5
1    Y      7


Once data is loaded into a DataFrame, the first step is to inspect it.
Its shape, structure and basic statistics.

# This if for DataFrame
df.head(n) - view the first n rows (default is 5)
df.tail(n) - view the last n rows
df.shape - get the number of rows and colums as tuple (n_rows, n_cols)
df.columns - get the column labels
df.index - get the index (row labels)
df.dtypes - data types of each column
df.info() - concise summary: shows the index range, column names,
            non-null counts, and dtypes.
df.describe() - descriptive statics for numeric columns
                (count, mean, std, min, quartiles, max)
if we pass include='all', it will attempt to summarize non-numerical columns.
(example count of unique, top value frequency)

# Series
ser.value_counts() - frequency count of unique values
ser.unique() - array of unique values
ser.mean(), ser.min(), ser.max(), ser.sum(), ser.median() - common statistics
ser.isna() - Boolean series indicating missing values (Nans)

Whenever you get a new dataset, you should always perform an initial exploration.
For example, if you're analyzing a dataset of customer purchases-
How many records are there?
What columns(features) does it have?
Are they numeric, catagorical, dates?
Are there missing values to worry about?
What are the ranges of typical values of numerical values (describe())?
Do any columns have suspecious values (like negative values, or ages)

In [41]:
# Code Demo
data = {
    "Name": ["Alice", "Bob", "Charlie", "Dana"],
    "Age": [45, 33, 55, 22],
    "City": ["NY", "LA", "NY", "TX"],
    "Score": [89, 90, 88, 88]
}
df = pd.DataFrame(data)
print("First 3 rows:\n", df.head(3),"\n")
print("DataFrame Shape", df.shape)
print("Colums:", df.columns)
print("Data Types:\n", df.dtypes,"\n")
df.info() # Prints into to console
print("\nSummary Statistics\n", df.describe())
print(df["City"])
print("Unique Cities:", df["City"].unique())
print("Counts Of Each City:\n", df["City"].value_counts())

First 3 rows:
       Name  Age City  Score
0    Alice   45   NY     89
1      Bob   33   LA     90
2  Charlie   55   NY     88 

DataFrame Shape (4, 4)
Colums: Index(['Name', 'Age', 'City', 'Score'], dtype='object')
Data Types:
 Name     object
Age       int64
City     object
Score     int64
dtype: object 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
 3   Score   4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 260.0+ bytes

Summary Statistics
              Age      Score
count   4.000000   4.000000
mean   38.750000  88.750000
std    14.338177   0.957427
min    22.000000  88.000000
25%    30.250000  88.000000
50%    39.000000  88.500000
75%    47.500000  89.250000
max    55.000000  90.000000
0    NY
1    LA
2    NY
3    TX
Name: City, dty