## Lecture 3: [1/20]

## Working With Different File Formats : Pandas

---

## Reading Twitter JSON with Pandas

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`import pandas as pd`
* This line imports the pandas library, a powerful and widely used library for data manipulation and analysis in Python.1
* The `as pd` part creates an alias for the `pandas` library, making it easier to use later in the code (e.g., you can use `pd.read_json()` instead of `pandas.read_json()`)

In [1]:
import pandas as pd

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Panda Formula**

  `variable_name = pd.read_method(filepath_or_data, arguments)`


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_tw = pd.read_json('/mnt/data/public/twitter/sample/data-18062209.json.bz2', lines=True)`
* This line reads a JSON file from the specified path (`'/mnt/data/public/twitter/sample/data-18062209.json.bz2'`) and creates a pandas DataFrame called `df_tw`
* `pd.read_json()` is a function from the pandas library that reads data from a JSON file.
* The `lines=True` argument tells pandas that the JSON file contains multiple JSON objects, one per line. This allows pandas to efficiently read and parse the data.

**Summary**
* The code snippet demonstrates how to use the pandas library to read a JSON file containing Twitter data. The pd.read_json() function efficiently reads the data line by line and creates a pandas DataFrame, which can then be used for further analysis and exploration.

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Future Warning Messages** 
* These messages indicate that the behavior of the to_datetime() function in pandas might change in future versions.
* `to_datetime()` is a function used to convert strings to datetime objects.
* The warning suggests that if you're parsing strings that are not explicitly cast to datetime objects, the behavior might change in the future.
* It's generally recommended to address these warnings by explicitly casting the relevant strings to datetime objects before calling `to_datetime()` to ensure consistent behavior in future versions of pandas.

In [2]:
df_twitter = pd.read_json(
    "/mnt/data/public/twitter/sample/data-18062209.json.bz2" , 
    lines=True 
)

  df_twitter = pd.read_json(
  df_twitter = pd.read_json(


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_twitter`
* This line displays the contents of the `df_twitter` DataFrame.

**DataFrame Structure**: 
* The output shows the first few rows and the last few rows of the DataFrame: 
* The DataFrame has four columns:
    * `created_at`: Timestamp of when the tweet was created.
    * `id`: Unique identifier for the tweet (likely a numerical ID).
    * `id_str`: String representation of the tweet ID.
    * `text`: The actual text content of the tweet.

In [3]:
df_twitter

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,possibly_sensitive,delete,display_text_range,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,extended_tweet,withheld_in_countries
0,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,得点開示してもらお、それでもしトップ3入ってたら絶対訴える,"<a href=""http://twitter.com/download/iphone"" r...",0.0,,,,,...,,,,,,,,,,
1,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,RT @tatanakan: ？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？...,"<a href=""http://twitter.com/download/android"" ...",0.0,,,,,...,,,,,,,,,,
2,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,ふなっしー　チャーム付ボールペン 青【ふなっしーグッズ/文房具/筆記具/ボールペン/文具/可...,"<a href=""https://twitter.com/funassyi_cafe"" re...",0.0,,,,,...,0.0,,,,,,,,,
3,NaT,,,,,,,,,,...,,"{'status': {'id': 1010092958819840005, 'id_str...",,,,,,,,
4,2018-06-22 09:32:01+00:00,1.010093e+18,1.010093e+18,RT @taejinsus: all the BTS outros deserve bett...,"<a href=""http://twitter.com/download/android"" ...",0.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72145,NaT,,,,,,,,,,...,,"{'status': {'id': 1010100013668581377, 'id_str...",,,,,,,,
72146,2018-06-22 09:59:27+00:00,1.010100e+18,1.010100e+18,RT @kirakira555star: １５ｇ　２００個＋α　３・４・６・８・１０ｍｍ　コ...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",0.0,,,,,...,0.0,,,,,,,,,
72147,2018-06-22 09:59:27+00:00,1.010100e+18,1.010100e+18,@GRND_MAiRU まいるさんだから許した😡(ちょろいオタク)\n地獄少女の曲良すぎない...,"<a href=""http://twitter.com/download/iphone"" r...",0.0,1.010100e+18,1.010100e+18,7.519702e+17,7.519702e+17,...,,,"[12, 84]",,,,,,,
72148,2018-06-22 09:59:38+00:00,1.010100e+18,1.010100e+18,なんで人間は性行為の人数を自慢するんだ？失敗してきた数だろ？それか尻軽だと思われるべきだと思...,"<a href=""http://makebot.sh"" rel=""nofollow"">ナナシ...",0.0,,,,,...,,,,,,,,,,


## Displaying DataFrame to JSON Conversion

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`print(df_twitter.head().to_json(orient='columns', indent=2))`
* This line is printing the first few rows of the `df_twitter` DataFrame in a JSON format.
* `df_twitter.head()`: This part selects the first 5 rows of the DataFrame using the head() method.

`.to_json(orient='columns', indent=2):`
* This method converts the selected DataFrame to a JSON string.
* `orient='columns'` specifies that the output should be organized by columns.
* `indent=2` adds indentation to the JSON string for better readability.

**Summary:**
* The code snippet displays the first 5 rows of the `df_twitter` DataFrame in a well-formatted JSON format, making it easier to read and understand the structure of the data.


In [4]:
print(df_twitter.head().to_json(orient="columns", indent = 2))

{
  "created_at":{
    "0":1529659921000,
    "1":1529659921000,
    "2":1529659921000,
    "3":null,
    "4":1529659921000
  },
  "id":{
    "0":1.010093039e+18,
    "1":1.010093039e+18,
    "2":1.010093039e+18,
    "3":null,
    "4":1.010093039e+18
  },
  "id_str":{
    "0":1.010093039e+18,
    "1":1.010093039e+18,
    "2":1.010093039e+18,
    "3":null,
    "4":1.010093039e+18
  },
  "text":{
    "0":"\u5f97\u70b9\u958b\u793a\u3057\u3066\u3082\u3089\u304a\u3001\u305d\u308c\u3067\u3082\u3057\u30c8\u30c3\u30d73\u5165\u3063\u3066\u305f\u3089\u7d76\u5bfe\u8a34\u3048\u308b",
    "1":"RT @tatanakan: \uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\uff1f\

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`with open('tweets1000_cols.json', 'w') as f:`
* This line opens a file named "tweets1000_cols.json" in write mode ('w').
* The `with` statement ensures that the file is properly closed even if an error occurs within the indented block.
* The file object is assigned to the variable f for further use

`f.write(df_twitter.head().to_json(orient='columns', indent=2))`
* This line writes data to the file opened in the previous step.
* `df_twitter.head()` selects the first 5 rows of the `df_twitter` DataFrame.
* .to_json(orient='columns', indent=2) converts the selected DataFrame to a JSON string.
    * `orient='columns'` specifies that the output should be organized by columns.
    * `indent=2` adds indentation to the JSON string for better readability.
* `f.write(...)` writes the generated JSON string to the file

In [8]:
with open ("tweets100_cols.json", 'w') as f:
    f.write(df_twitter.head().to_json(orient="columns", indent =2))

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_twitter.head().to_json('tweets1000_cols.json', orient='columns', indent=2)`
* This line directly writes the JSON representation of the first 5 rows of the `df_twitter` DataFrame to the file "tweets1000_cols.json".


In [11]:
df_twitter.head().to_json("tweets1000_cols.json", orient="columns", indent= 2)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`import json`
*This line imports the `json` library, which is used for working with JSON data in Python.

`print(df_twitter.head().to_json(orient='index', indent=2))`
* This line prints the first 5 rows of the `df_twitter` DataFrame in a JSON format to the console.
* `orient='index'` specifies that the output should be organized by index (row number).
* `indent=2` adds indentation to the JSON string for better readability.

**Summary**
* Write the first 5 rows of a pandas DataFrame (df_twitter) to a JSON file.
* Print the same data in JSON format to the console.

In [12]:
import json

In [13]:
print(df_twitter.head().to_json(orient="index", indent= 2))

{
  "0":{
    "created_at":1529659921000,
    "id":1.010093039e+18,
    "id_str":1.010093039e+18,
    "text":"\u5f97\u70b9\u958b\u793a\u3057\u3066\u3082\u3089\u304a\u3001\u305d\u308c\u3067\u3082\u3057\u30c8\u30c3\u30d73\u5165\u3063\u3066\u305f\u3089\u7d76\u5bfe\u8a34\u3048\u308b",
    "source":"<a href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>",
    "truncated":0.0,
    "in_reply_to_status_id":null,
    "in_reply_to_status_id_str":null,
    "in_reply_to_user_id":null,
    "in_reply_to_user_id_str":null,
    "in_reply_to_screen_name":null,
    "user":{
      "id":828743842237001728,
      "id_str":"828743842237001728",
      "name":"\u3058\u3046\ud83d\udc36",
      "screen_name":"jj_iiu",
      "location":"Precious \u4e16\u754c \u753a",
      "url":null,
      "description":"@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT\u21d26\/17 \u30a4\u30b3\u30e9\u30d6\u30c0\u30f3\u30b9\u30ec\u30c3\u30b9\u30f3\u30016\/24 NBC\u5f8c\u591c\u796d\u3

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`print(df_twitter.head().to_json(orient='records', indent=2))`
* `df_twitter.head()`: This part selects the first 5 rows of the 
`df_twitter` DataFrame.
* `.to_json(orient='records', indent=2)`:

    * This method converts the selected DataFrame to a JSON string.
    * `orient='records'` specifies that the output should be organized as a list of dictionaries, where each dictionary represents a row (record) in the DataFrame.
    * `indent=2` adds indentation to the JSON string for better readability.
* `print(...)`: This displays the resulting JSON string to the console.

**Summary**
* The code snippet is printing the first 5 rows of the df_twitter DataFrame in a JSON format where each row is represented as a separate dictionary object. This is useful for visualizing the data in a structured and human-readable format, and it can also be used to easily save the data to a JSON file for later use.

In [14]:
print(df_twitter.head().to_json(orient="records", indent=2))

[
  {
    "created_at":1529659921000,
    "id":1.010093039e+18,
    "id_str":1.010093039e+18,
    "text":"\u5f97\u70b9\u958b\u793a\u3057\u3066\u3082\u3089\u304a\u3001\u305d\u308c\u3067\u3082\u3057\u30c8\u30c3\u30d73\u5165\u3063\u3066\u305f\u3089\u7d76\u5bfe\u8a34\u3048\u308b",
    "source":"<a href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>",
    "truncated":0.0,
    "in_reply_to_status_id":null,
    "in_reply_to_status_id_str":null,
    "in_reply_to_user_id":null,
    "in_reply_to_user_id_str":null,
    "in_reply_to_screen_name":null,
    "user":{
      "id":828743842237001728,
      "id_str":"828743842237001728",
      "name":"\u3058\u3046\ud83d\udc36",
      "screen_name":"jj_iiu",
      "location":"Precious \u4e16\u754c \u753a",
      "url":null,
      "description":"@nanjolno * @saito_nagisa * @yanaginagi * @maaya_taso* NEXT\u21d26\/17 \u30a4\u30b3\u30e9\u30d6\u30c0\u30f3\u30b9\u30ec\u30c3\u30b9\u30f3\u30016\/24 NBC\u5f8c\u591c\u796d\u30017

## Reading TSV Files into a Pandas DataFrame

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_temp = pd.read_csv(...)`
* This line reads data from a tab-separated values (TSV) file into a pandas DataFrame.
* `pd.read_csv()` is a function from the pandas library that is specifically designed for reading data from CSV files.
* `'/mnt/data/public/amazon-reviews/amazon_reviews_us_Apparel_v1_00.tsv.gz'` is the path to the TSV file.
* `delimiter='\t'` specifies that the fields in the file are separated by tab characters.
* `on_bad_lines='` skip' tells pandas to skip any lines in the file that cannot be properly parsed. This is useful for handling potential errors or inconsistencies in the data.
* `nrows=1000` tells pandas to read only the first 1000 rows from the file.

**Summary**
* The code snippet reads the first 1000 rows of a TSV file containing Amazon product reviews into a pandas DataFrame. The DataFrame is then inspected by checking its shape and displaying the first few rows to get an initial understanding of the data

In [17]:
df_temp = pd.read_csv(
    "/mnt/data/public/amazon-reviews/amazon_reviews_us_Apparel_v1_00.tsv.gz",
    delimiter='\t', on_bad_lines="skip", nrows = 1000
)
    

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_temp.shape`
* This line prints the shape of the `df_temp` DataFrame, which represents the number of rows and columns.
* The output `(1000, 15)` indicates that the DataFrame has 1000 rows and 15 columns.

`df_temp`
* This line displays the first few rows of the `df_temp` DataFrame.
* The output shows the column names (e.g., `marketplace`, `customer_id`, `review_id`, `product_id`, `product_title`) and the corresponding values for the first few rows.

**Summary**
* The code snippet reads the first 1000 rows of a TSV file containing Amazon product reviews into a pandas DataFrame. The DataFrame is then inspected by checking its shape and displaying the first few rows to get an initial understanding of the data.

In [19]:
df_temp.shape

(1000, 15)

In [20]:
df_temp

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,32158956,R1KKOXHNI8MSXU,B01KL6O72Y,24485154,Easy Tool Stainless Steel Fruit Pineapple Core...,Apparel,4,0,0,N,Y,★ THESE REALLY DO WORK GREAT WITH SOME TWEAKING ★,"These Really Do Work Great, But You Do Need To...",2013-01-14
1,US,2714559,R26SP2OPDK4HT7,B01ID3ZS5W,363128556,V28 Women Cowl Neck Knit Stretchable Elasticit...,Apparel,5,1,2,N,Y,Favorite for winter. Very warm!,I love this dress. Absolute favorite for winte...,2014-03-04
2,US,12608825,RWQEDYAX373I1,B01I497BGY,811958549,James Fiallo Men's 12-Pairs Low Cut Athletic S...,Apparel,5,0,0,N,Y,Great Socks for the money.,"Nice socks, great colors, just enough support ...",2015-07-12
3,US,25482800,R231YI7R4GPF6J,B01HDXFZK6,692205728,Belfry Gangster 100% Wool Stain-Resistant Crus...,Apparel,5,0,0,N,Y,Slick hat!,"I bought this for my husband and WOW, this is ...",2015-06-03
4,US,9310286,R3KO3W45DD0L1K,B01G6MBEBY,431150422,JAEDEN Women's Beaded Spaghetti Straps Sexy Lo...,Apparel,5,0,0,N,Y,I would do it again!,Perfect dress and the customer service was awe...,2015-06-12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,US,15229609,RZQ44CTX4DN4C,B013CSAX9O,203771527,Clover Men's Aluminum Wallet,Apparel,5,0,0,N,N,Less is More!,This is one NEAT minimal wallet. I am still ad...,2015-08-18
996,US,15276666,R1SBJWY8KQY7UX,B013CSAW68,203771527,Clover Men's Aluminum Wallet,Apparel,5,0,0,N,N,Best Wallet Yet!,Allows me to hold a lot more cash and cards th...,2015-08-06
997,US,53047034,R3V2J6BZ4NZ979,B013CSAW68,203771527,Clover Men's Aluminum Wallet,Apparel,5,0,0,N,N,5++ stars minus one,"If you don't have a thick car key, this wallet...",2015-08-24
998,US,2153196,R1F0DQ62SJRHVR,B013CSAW68,203771527,Clover Men's Aluminum Wallet,Apparel,5,1,1,N,N,Life Hack,Best wallet ever! It's amazing that I can walk...,2015-08-14


## Data Import, Indexing, and Cleaning

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_ncr = pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx').drop(columns=['Unnamed: 0'])`
* This line reads data from the Excel file located at /mnt/data/public/census/2020/NCR.xlsx into a pandas DataFrame called df_ncr.
* `.drop(columns=['Unnamed: 0'])`: This removes a column named 'Unnamed: 0' from the DataFrame. This column is likely an extra column that was not intended to be included in the analysis.

`df_ncr = df_ncr.set_index('Total Population by Province, City, and Municipality:')`
* This line sets the 'Total Population by Province, City, and Municipality' column as the index of the DataFrame.
* The index is a unique identifier for each row in the DataFrame.1 By setting this column as the index, it makes it easier to access and manipulate the data based on the location.   

`df_ncr = df_ncr.iloc[2:]`
* This line selects all rows from the DataFrame starting from the third row (index 2) onwards.
* `iloc` is used for integer-location based indexing, allowing you to select rows and columns by their integer position.

`df_ncr = df_ncr.rename(columns={'Unnamed: 2': 'Population'})`
* This line renames the column 'Unnamed: 2' to 'Population'.
* This makes the column name more meaningful and easier to work with.

`df_ncr.index.name = 'Province, City, and Municipality'`
* This line sets the name of the index to 'Province, City, and Municipality'.
* This makes the DataFrame more readable and informative.

`df_ncr`
* This line displays the final DataFrame after all the transformations have been applied.
* The output shows the DataFrame with the 'Province, City, and Municipality' as the index and the corresponding 'Population' values.

**Summary**
* The code snippet reads data from an Excel file, cleans it by removing unnecessary columns and selecting the relevant rows, renames columns for better readability, and sets the index to the location column. The resulting DataFrame contains the population data for different cities and municipalities in the National Capital Region (NCR) of the Philippines

In [23]:
df_ncr = pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx').drop(columns=['Unnamed: 0'])
df_ncr = df_ncr.set_index('Total Population by Province, City, and Municipality:')
df_ncr = df_ncr.dropna()
df_ncr = df_ncr.iloc[2:]
df_ncr = df_ncr.rename(
    columns={
        'Unnamed: 2': 'Population'
    }
)
df_ncr.index.name = 'Province, City, and Municipality'
df_ncr

Unnamed: 0_level_0,Population
"Province, City, and Municipality",Unnamed: 1_level_1
NATIONAL CAPITAL REGION,13484462
CITY OF MANILA,1846513
CITY OF MANDALUYONG,425758
CITY OF MARIKINA,456059
CITY OF PASIG,803159
QUEZON CITY,2960048
CITY OF SAN JUAN,126347
CITY OF CALOOCAN,1661584
CITY OF MALABON,380522
CITY OF NAVOTAS,247543


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_ncr = pd.read_excel('NCR.xlsx').drop(columns=['Unnamed: 0'])`
* `pd.read_excel('NCR.xlsx')`: This reads the data from the Excel file named "NCR.xlsx" and creates a pandas DataFrame (a table-like data structure).   
* `.drop(columns=['Unnamed: 0'])`: This removes a column named "Unnamed: 0" from the DataFrame. This is often done to clean up extra index columns or unnecessary data that might be present in the Excel file.   

`df_ncr = df_ncr.set_index('Total Population by Province, City, and Municipality:')`
* This line sets the column labeled "Total Population by Province, City, and Municipality:" as the index of the DataFrame. The index is used to label the rows, making it easier to access data by location name rather than just row number.   

`df_ncr = df_ncr.dropna()`
* This line removes any rows that have missing values (NaN) in any of the columns. This is a data cleaning step to ensure that only complete data is used.   

`df_ncr = df_ncr.iloc[2:]`
* This line selects all rows starting from the third row (index 2) to the end of the DataFrame. `iloc` is used for integer-based indexing, so it selects rows based on their numerical position. This step likely removes header rows or introductory text from the Excel file that are not part of the data.

`df_ncr.columns = ['Population']`
* This line sets the column names of the DataFrame to just "Population". This assumes that after dropping the unnecessary columns, the only column remaining contains population data.

`df_ncr.index.name = 'Province, City, and Municipality'`
* This line sets the name of the index to "Province, City, and Municipality". This provides a descriptive label for the index, which improves readability.

`df_ncr`
* This line displays the final DataFrame `df_ncr` after all the transformations. The output you see is the DataFrame with "Province, City, and Municipality" as the index and "Population" as the column, showing population figures for different locations in the NCR.

**Summary**
* The code reads population data from an Excel file, cleans and prepares it by removing unnecessary data, sets the location as the index, renames the population column, and then displays the resulting DataFrame. The final DataFrame shows the population of different cities and municipalities in the NCR.

In [24]:
df_ncr = pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx').drop(columns=['Unnamed: 0'])
df_ncr = df_ncr.set_index('Total Population by Province, City, and Municipality:')
df_ncr = df_ncr.dropna()
df_ncr = df_ncr.iloc[2:]
df_ncr.columns = ['Population']
df_ncr.index.name = 'Province, City, and Municipality'
df_ncr

Unnamed: 0_level_0,Population
"Province, City, and Municipality",Unnamed: 1_level_1
NATIONAL CAPITAL REGION,13484462
CITY OF MANILA,1846513
CITY OF MANDALUYONG,425758
CITY OF MARIKINA,456059
CITY OF PASIG,803159
QUEZON CITY,2960048
CITY OF SAN JUAN,126347
CITY OF CALOOCAN,1661584
CITY OF MALABON,380522
CITY OF NAVOTAS,247543


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`df_ncr = pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx', header=None)`

`pd.read_excel('NCR.xlsx', header=None)`
* This reads the data from the Excel file "NCR.xlsx" and creates a pandas DataFrame. The `header=None` argument is crucial here. It tells pandas that the Excel file does not have a header row (the first row containing column names). This means pandas will assign default numerical column indices (0, 1, 2, etc.).
df_ncr = df_ncr.loc[6:25, 1:]

`df_ncr.loc[6:25, 1:]`
* This selects a specific portion of the DataFrame.
    * `6:25`: This selects rows from index 6 up to and including index 25. Remember, since `header=None`, these are row numbers, not labels. This range likely corresponds to the rows in your Excel file that contain the city/municipality data you're interested in.
    *`1`: This selects all columns starting from the second column (index 1) to the end. This assumes the first column might be an unnecessary index or label in your Excel file.

`df_ncr = df_ncr.dropna(how='all')`
* `df_ncr.dropna(how='all')`: This removes any rows where all values are missing (NaN or empty strings). This is a data cleaning step to remove any completely empty rows that might have been included in the selected range (rows 6 to 25).

`df_ncr.columns = ['Province, City, and Municipality', 'Population']`
* `df_ncr.columns = [...]` : This assigns the column names to the DataFrame. Since you used header=None initially, pandas assigned default numerical column indices. This line replaces those with your desired names: "Province, City, and Municipality" and "Population".

`df_ncr`
* This line displays the final DataFrame `df_ncr` after all the operations. The output you see is the DataFrame with your specified column names, containing population data for different cities and municipalities in the NCR.

**Summary**
* The code reads data from an Excel file without headers, selects a specific range of rows and columns, removes completely empty rows, assigns meaningful column names, and then displays the resulting DataFrame. The final DataFrame shows the population of different cities and municipalities in the NCR. The key here is understanding that `header=None` impacts how row and column selection works initially.

In [26]:
df_ncr = pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx', header = None)
df_ncr = df_ncr.loc[6:25, 1:]
df_ncr = df_ncr.dropna(how='all')
df_ncr.columns = ['Province, City, and Municipality', 'Population']
df_ncr


Unnamed: 0,"Province, City, and Municipality",Population
6,NATIONAL CAPITAL REGION,13484462
8,CITY OF MANILA,1846513
9,CITY OF MANDALUYONG,425758
10,CITY OF MARIKINA,456059
11,CITY OF PASIG,803159
12,QUEZON CITY,2960048
13,CITY OF SAN JUAN,126347
14,CITY OF CALOOCAN,1661584
15,CITY OF MALABON,380522
16,CITY OF NAVOTAS,247543


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx', sheet_name='NCR by barangay')`
* `pd.read_excel(...)`: This function reads data from an Excel file.
* `'/mnt/data/public/census/2020/NCR.xlsx'`: This is the file path to the Excel file you're reading.
* `sheet_name='NCR by barangay'`: This specifies that you want to read data from the sheet named "NCR by barangay" within the Excel file. Excel files can contain multiple sheets, so this argument is essential to select the correct one.

**Summary**
* The code reads data from a specific sheet ("NCR by barangay") of an Excel file using pandas. The resulting DataFrame has four columns, but the column names are generic ("Unnamed: 0," etc.), and the data in the location column is a mix of actual location names and descriptive text. This suggests that further data cleaning and processing will be necessary to make this data usable for analysis.  The `NaN` values in the other columns indicate that those columns likely don't contain relevant data for most rows in this particular sheet.

In [28]:
pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx', sheet_name='NCR by barangay')

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,"Total Population by Province, City, and Municipality:",Unnamed: 3
0,,,"as of May 1, 2020",
1,,,,
2,,,"Province, City, Municipality,",Total
3,,,and Barangay,Population
4,,,,
...,...,...,...,...
1786,,,3 Created into a barangay under Republic Act N...,
1787,,,"taken from Barangay Tanza, City of Navotas.",
1788,,,,
1789,,,Source:,


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `pd.read_excel(...)`: This is a pandas function that reads data from an Excel file and creates a DataFrame (a table-like data structure).

* `'/mnt/data/public/census/2020/NCR.xlsx'`: This is the file path to the Excel file. It specifies the location of the file on your system.

* `sheet_name=None`: This is the key part.  By setting `sheet_name=None`, you're instructing pandas to read all sheets from the Excel file.  The default behavior of `pd.read_excel()` is to read only the first sheet.

**Summary**
* The code reads data from a specific sheet ("NCR by barangay") of an Excel file using pandas. The resulting DataFrame has four columns, but the column names are generic ("Unnamed: 0," etc.), and the data in the location column is a mix of actual location names and descriptive text. This suggests that further data cleaning and processing will be necessary to make this data usable for analysis.  The NaN values in the other columns indicate that those columns likely don't contain relevant data for most rows in this particular sheet.

In [29]:
pd.read_excel('/mnt/data/public/census/2020/NCR.xlsx', sheet_name=None)

{'NCR by city & mun':     Unnamed: 0 Total Population by Province, City, and Municipality:  \
 0          NaN                                  as of May 1, 2020      
 1          NaN                                                NaN      
 2          NaN                  Province, City, and Municipality       
 3          NaN                                       and Barangay      
 4          NaN                                                NaN      
 5          NaN                            NATIONAL CAPITAL REGION      
 6          NaN                                                NaN      
 7          NaN                                     CITY OF MANILA      
 8          NaN                                CITY OF MANDALUYONG      
 9          NaN                                   CITY OF MARIKINA      
 10         NaN                                      CITY OF PASIG      
 11         NaN                                        QUEZON CITY      
 12         NaN               

In [30]:
df_ncr.to_excel('/mnt/data/public/census/2020/NCR.xlsx')

PermissionError: [Errno 13] Permission denied: '/mnt/data/public/census/2020/NCR.xlsx'

In [None]:
<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 


In [31]:
with pd.ExcelWriter("/mnt/data/public/census/2020/NCR.xlsx") as writer:
    df_ncr.to_excel(writer, sheet_name='ncr1')
    df_ncr.to_excel(writer, sheet_name='ncr2')

PermissionError: [Errno 13] Permission denied: '/mnt/data/public/census/2020/NCR.xlsx'