## Reading Data from Various Sources Using Pandas

In [2]:
import pandas as pd
from io import StringIO
json_data ='''
[
 {"employee_name": "James", "email": "james@gmail.com", "job_profile": "Team Lead"},
  {"employee_name": "Michael", "email": "michael@gmail.com", "job_profile": "Senior Developer"}
]'''
df=pd.read_json(StringIO(json_data))
df

Unnamed: 0,employee_name,email,job_profile
0,James,james@gmail.com,Team Lead
1,Michael,michael@gmail.com,Senior Developer


Converting Data frame Back TO JSON

In [3]:
json_default = df.to_json()
print(json_default)

{"employee_name":{"0":"James","1":"Michael"},"email":{"0":"james@gmail.com","1":"michael@gmail.com"},"job_profile":{"0":"Team Lead","1":"Senior Developer"}}


In [None]:
json_records = df.to_json(orient='records')
print(json_records)

# orient='records proovides a list type format where each element corresponds to a row in  the dataframe

[{"employee_name":"James","email":"james@gmail.com","job_profile":"Team Lead"},{"employee_name":"Michael","email":"michael@gmail.com","job_profile":"Senior Developer"}]


Reading CSV DATA FROM URL

In [6]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df_csv = pd.read_csv(url, header=None)
print(df_csv.head())

   0      1     2     3     4    5     6     7     8     9     10    11    12  \
0   1  14.23  1.71  2.43  15.6  127  2.80  3.06  0.28  2.29  5.64  1.04  3.92   
1   1  13.20  1.78  2.14  11.2  100  2.65  2.76  0.26  1.28  4.38  1.05  3.40   
2   1  13.16  2.36  2.67  18.6  101  2.80  3.24  0.30  2.81  5.68  1.03  3.17   
3   1  14.37  1.95  2.50  16.8  113  3.85  3.49  0.24  2.18  7.80  0.86  3.45   
4   1  13.24  2.59  2.87  21.0  118  2.80  2.69  0.39  1.82  4.32  1.04  2.93   

     13  
0  1065  
1  1050  
2  1185  
3  1480  
4   735  


In [7]:
df_csv.to_csv('wine.csv', index=False)

In [8]:
# Install dependencies if not already installed
!pip install lxml html5lib beautifulsoup4

Collecting lxml
  Downloading lxml-6.0.2-cp314-cp314-win_amd64.whl.metadata (3.7 kB)
Collecting html5lib
  Using cached html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.14.3-py3-none-any.whl.metadata (3.8 kB)
Collecting webencodings (from html5lib)
  Using cached webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Collecting soupsieve>=1.6.1 (from beautifulsoup4)
  Using cached soupsieve-2.8-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=4.0.0 (from beautifulsoup4)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading lxml-6.0.2-cp314-cp314-win_amd64.whl (4.1 MB)
   ---------------------------------------- 0.0/4.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/4.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/4.1 MB ? eta -:--:--
   -- ------------------------------------- 0.3/4.1 MB ? eta -:--:--
   ----- -----------------------------

In [9]:
url_html = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df_list = pd.read_html(url_html)
print(len(df_list))  # Number of tables found
print(df_list[0].head())  # Display first table

1
                               Bank Name          City         State   Cert  \
0           The Santa Anna National Bank    Santa Anna         Texas   5520   
1                   Pulaski Savings Bank       Chicago      Illinois  28611   
2     The First National Bank of Lindsay       Lindsay      Oklahoma   4134   
3  Republic First Bank dba Republic Bank  Philadelphia  Pennsylvania  27332   
4                          Citizens Bank      Sac City          Iowa   8758   

               Acquiring Institution      Closing Date  Fund  Sort ascending  
0          Coleman County State Bank     June 27, 2025                 10549  
1                    Millennium Bank  January 17, 2025                 10548  
2             First Bank & Trust Co.  October 18, 2024                 10547  
3  Fulton Bank, National Association    April 26, 2024                 10546  
4          Iowa Trust & Savings Bank  November 3, 2023                 10545  


In [13]:
# Install openpyxl for .xlsx files
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl

   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openpyxl]
   -------------------- ------------------- 1/2 [openp

Reading Excel Files
Pandas can read Excel files using pd.read_excel(). You can specify the sheet name if the file contains multiple sheets. Reading Excel files requires the openpyxl or xlrd library depending on the Excel format.

In [15]:
df_excel = pd.read_excel('SIH_PS_2024.xlsx', sheet_name=0)
print(df_excel.head())

  Statement_id                                              Title  Category  \
0      SIH1524  Innovating for Sustainability: Driving Smart R...  Hardware   
1      SIH1525  Innovating for Sustainability: Driving Smart R...  Software   
2      SIH1526                                 Student Innovation  Hardware   
3      SIH1527                                 Student Innovation  Hardware   
4      SIH1528                                 Student Innovation  Hardware   

             Technology_Bucket Datasetfile  \
0  Smart Resource Conservation         NaN   
1  Smart Resource Conservation         NaN   
2              Smart Education         NaN   
3          Disaster Management         NaN   
4                Miscellaneous         NaN   

                                         Description         Department  \
0  Innovating for Sustainability: Driving Smart R...  Godrej Appliances   
1  Innovating for Sustainability: Driving Smart R...  Godrej Appliances   
2  Smart Education, a C

print(df_excel.head())
Pickle Files
Pickle files serialize Python objects into byte streams for storage or transmission. Pandas supports reading and writing DataFrames to pickle files using to_pickle() and read_pickle() methods. This is useful for saving models or data states in machine learning projects.

In [16]:
df_excel.to_pickle('data.pkl')
df_pickle = pd.read_pickle('data.pkl')
print(df_pickle.head())

  Statement_id                                              Title  Category  \
0      SIH1524  Innovating for Sustainability: Driving Smart R...  Hardware   
1      SIH1525  Innovating for Sustainability: Driving Smart R...  Software   
2      SIH1526                                 Student Innovation  Hardware   
3      SIH1527                                 Student Innovation  Hardware   
4      SIH1528                                 Student Innovation  Hardware   

             Technology_Bucket Datasetfile  \
0  Smart Resource Conservation         NaN   
1  Smart Resource Conservation         NaN   
2              Smart Education         NaN   
3          Disaster Management         NaN   
4                Miscellaneous         NaN   

                                         Description         Department  \
0  Innovating for Sustainability: Driving Smart R...  Godrej Appliances   
1  Innovating for Sustainability: Driving Smart R...  Godrej Appliances   
2  Smart Education, a C