In [94]:
import pandas as pd
file_path = "PRPL-Facebook-Government-Report-2021_H1.csv"
df = pd.read_csv(file_path)

**First, I started by importing the necessary libraries. Then, I read the CSV files using pd.read_csv() and checked the first few rows to understand the structure of the data.**

In [96]:
df.head()

Unnamed: 0,Country,Preservation Requests,Preservation Accounts Preserved,Total Requests,Number of Requests Where Some Data Produced,Total Requests Accounts,Total Requests Percentage,Legal Process Request Total,Legal Number of Requests Where Some Data Produced,Legal Process Request Total Accounts,...,Title III Percentage,FISA Content Requests,FISA Content Requests Accounts,FISA Content Requests Percentage,FISA Non-Content Requests,FISA Non-Content Requests Accounts,FISA Non-Content Requests Percentage,NSLs,NSLs Accounts,NSLs Percentage
0,Afghanistan,0,0,0,0,0,0%,0,0,0,...,,,,,,,,,,
1,Aland Islands,0,0,0,0,0,0%,0,0,0,...,,,,,,,,,,
2,Albania,81,117,24,17,63,71%,24,17,63,...,,,,,,,,,,
3,Algeria,0,0,0,0,0,0%,0,0,0,...,,,,,,,,,,
4,American Samoa,0,0,0,0,0,0%,0,0,0,...,,,,,,,,,,


**I noticed that some columns had a lot of NaN values, that means the data was missing. To check what columns have missing values i used below syntax to check the missing values**

In [97]:
df.isnull().sum()

Country                                                0
Preservation Requests                                  0
Preservation Accounts Preserved                        0
Total Requests                                         0
Number of Requests Where Some Data Produced            0
Total Requests Accounts                                0
Total Requests Percentage                              0
Legal Process Request Total                            0
Legal Number of Requests Where Some Data Produced      0
Legal Process Request Total Accounts                   0
Legal Process Request Total Percentage                 0
Emergency Request Total                                0
ER Number of Requests Where Some Data Produced         0
Emergency Request Total Accounts                       0
Emergency Request Total Percentage                     0
Search Warrant                                       250
Search Warrant Accounts                              250
Search Warrant Percentage      

In [125]:
df.dtypes

Country                                              object
Preservation Requests                                object
Preservation Accounts Preserved                      object
Total Requests                                       object
Number of Requests Where Some Data Produced          object
Total Requests Accounts                              object
Total Requests Percentage                            object
Legal Process Request Total                          object
Legal Number of Requests Where Some Data Produced    object
Legal Process Request Total Accounts                 object
Legal Process Request Total Percentage               object
Emergency Request Total                              object
ER Number of Requests Where Some Data Produced       object
Emergency Request Total Accounts                     object
Emergency Request Total Percentage                   object
Year                                                  int64
dtype: object

**As there are missing values from "Search Warrant" to "NSLs Percentages". So,I removed all columns starting from "Search Warrant" and ending at "NSLs Percentage" this includes everything in between. These columns does not influence any of analysis or visuals, so removing them made the data more compact and readable.**
**I took references for syntax from the link** **https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html**
**After cleaning, I saved the modified DataFrames into new CSV files.**

In [110]:
file_path = "PRPL-Facebook-Government-Report-2021_H1.csv"
df = pd.read_csv(file_path)

start_col = 'Search Warrant'
end_col = 'NSLs Percentage'
start_index = df.columns.get_loc(start_col)
end_index = df.columns.get_loc(end_col)
columns_to_drop = df.columns[start_index:end_index+1].tolist()
df = df.drop(columns=columns_to_drop)

output_file_path = "PRPL-Facebook-Government-Report-2021_H1_modified.csv"
df.to_csv(output_file_path, index=False)

**Once I had the cleaned files, I re-imported the modified datasets and checked if there were any remaining columns or missing values (NaN). To ensured that the data is not having any missing values**

In [99]:
file_path = "PRPL-Facebook-Government-Report-2021_H1_modified.csv"
df = pd.read_csv(file_path)

In [100]:
df.columns

Index(['Country', 'Preservation Requests', 'Preservation Accounts Preserved',
       'Total Requests', 'Number of Requests Where Some Data Produced',
       'Total Requests Accounts', 'Total Requests Percentage',
       'Legal Process Request Total',
       'Legal Number of Requests Where Some Data Produced',
       'Legal Process Request Total Accounts',
       'Legal Process Request Total Percentage', 'Emergency Request Total',
       'ER Number of Requests Where Some Data Produced',
       'Emergency Request Total Accounts',
       'Emergency Request Total Percentage'],
      dtype='object')

In [101]:
df.isnull().sum()

Country                                              0
Preservation Requests                                0
Preservation Accounts Preserved                      0
Total Requests                                       0
Number of Requests Where Some Data Produced          0
Total Requests Accounts                              0
Total Requests Percentage                            0
Legal Process Request Total                          0
Legal Number of Requests Where Some Data Produced    0
Legal Process Request Total Accounts                 0
Legal Process Request Total Percentage               0
Emergency Request Total                              0
ER Number of Requests Where Some Data Produced       0
Emergency Request Total Accounts                     0
Emergency Request Total Percentage                   0
dtype: int64

**After cleaning, I saved the modified DataFrames into new CSV files. I repeated the same process for all remaining four years(2020,2022,2023,2024), ensuring each dataset was cleaned. Again, i imported the modified files of all the 5 years to merge and here i used concat method to merge all the years into one file.**

In [108]:
df_1 = pd.read_csv("PRPL-Facebook-Government-Report-2020_H1_modified.csv")
df_2 = pd.read_csv("PRPL-Facebook-Government-Report-2021_H1_modified.csv")
df_3 = pd.read_csv("PRPL-Facebook-Government-Report-2022_H1_modified.csv")
df_4 = pd.read_csv("PRPL-Facebook-Government-Report-2023_H1_modified.csv")
df_5 = pd.read_csv("PRPL-Facebook-Government-Report-2024_H1_modified.csv")

In [111]:
combined_df = pd.concat([df_1, df_2, df_3, df_4, df_5], ignore_index=True)
combined_df.head(100)

Unnamed: 0,Country,Preservation Requests,Preservation Accounts Preserved,Total Requests,Number of Requests Where Some Data Produced,Total Requests Accounts,Total Requests Percentage,Legal Process Request Total,Legal Number of Requests Where Some Data Produced,Legal Process Request Total Accounts,Legal Process Request Total Percentage,Emergency Request Total,ER Number of Requests Where Some Data Produced,Emergency Request Total Accounts,Emergency Request Total Percentage
0,Afghanistan,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%
1,Aland Islands,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%
2,Albania,9,9,11,4,24,36%,10,3,23,30%,1,1,1,100%
3,Algeria,1,1,2,0,3,0%,1,0,2,0%,1,0,1,0%
4,American Samoa,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Honduras,21,48,9,4,23,44%,3,1,13,33%,6,3,10,50%
96,Hong Kong,47,50,262,62,285,24%,259,61,282,24%,3,1,3,33%
97,Hungary,24,26,409,203,593,50%,406,203,587,50%,3,0,6,0%
98,Iceland,13,15,1,1,1,100%,1,1,1,100%,0,0,0,0%


**Next, I needed to add a "Year" column in each dataset. Initially, I was a bit confused about how to do this and how to merge the datasets afterward. To figure it out, I took help from Google to explore the methods available to add a year column and extract different years from the filenames.And Google gave me a couple of options like split ,re.serach etc but i was feel comfortable using split() method. I then created a function called extract_year() that uses the split() function. This function extracts the year from the filename by splitting it and selecting the correct part (the fifth part, which represents the year). After figuring out how to do it, I applied this function to each dataset (from 2020 to 2024) to add their respective years as a new column in each DataFrame.**

In [118]:
def extract_year(filename): 
    year_part = filename.split('-')[4]
    return int(year_part[:4])

df_1['Year'] = extract_year("PRPL-Facebook-Government-Report-2020_H1_modified.csv")
df_2['Year'] = extract_year("PRPL-Facebook-Government-Report-2021_H1_modified.csv")
df_3['Year'] = extract_year("PRPL-Facebook-Government-Report-2022_H1_modified.csv")
df_4['Year'] = extract_year("PRPL-Facebook-Government-Report-2023_H1_modified.csv")
df_5['Year'] = extract_year("PRPL-Facebook-Government-Report-2024_H1_modified.csv")

**Finally, after adding the "Year" column to each dataset, I merged all five DataFrames into one using the pd.concat() function. I saved the combined data into a new file called "Combined_file_with_year.csv".**

In [119]:
combined_df = pd.concat([df_1, df_2, df_3, df_4, df_5], ignore_index=True)
combined_df.to_csv("Combined_file_with_year.csv", index=False)

In [116]:
file_path = "Combined_file_with_year.csv"
df = pd.read_csv(file_path)

In [117]:
df.head(300)

Unnamed: 0,Country,Preservation Requests,Preservation Accounts Preserved,Total Requests,Number of Requests Where Some Data Produced,Total Requests Accounts,Total Requests Percentage,Legal Process Request Total,Legal Number of Requests Where Some Data Produced,Legal Process Request Total Accounts,Legal Process Request Total Percentage,Emergency Request Total,ER Number of Requests Where Some Data Produced,Emergency Request Total Accounts,Emergency Request Total Percentage,Year
0,Afghanistan,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%,2020
1,Aland Islands,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%,2020
2,Albania,9,9,11,4,24,36%,10,3,23,30%,1,1,1,100%,2020
3,Algeria,1,1,2,0,3,0%,1,0,2,0%,1,0,1,0%,2020
4,American Samoa,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%,2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,Chile,416,865,348,52,756,15%,295,19,626,6%,53,33,130,62%,2021
296,China,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%,2021
297,Christmas Island,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%,2021
298,Cocos (Keeling) Islands,0,0,0,0,0,0%,0,0,0,0%,0,0,0,0%,2021


**It was challenging to find a dataset for my topic because it’s a sensitive topic. I searched for datasts on several websites such as Kaggle and Google etc, including the official Meta (Facebook) website, where they have datasets available for different years (2019 to 2024). And I used Meta's datasets. I was struggling to figure out how to merge all these datasets into one, I found help on Google. I learned how to use the split() method to extract the year from the filenames and add it as a new column. This made it easier to merge all the datasets and save them into one combined file.**

**Additionally, I noticed that most of the data types are set to 'object,' and I need to convert the percentages into floats. I plan to address this and make the necessary changes to the data types in my next milestone.**

**Citations**
*Dataset- https://transparency.meta.com/reports/government-data-requests/data-types/*
*Split() method- https://www.youtube.com/watch?v=3zbK0dlRSBU*,*https://docs.python.org/3/library/stdtypes.html#str.split*