## Geospatial Unit Data Cleanning for Measuring Regional Economic Inequality in China

- Date:Feb 12 2024
- Author: Xiaozhong Sun
- Abstract: The main goal of this Jupyter Notebook is to overcome the MAUP at disaggregated geospatial unit of China from 1990 to 2021, which is to identify the spatial relationship between geospatial unit at prefecture-level. 

### Technical Set-ups

**Command in Terminal**
- conda create -n geeineq python=3.11
- conda activate geeineq
- conda install -c conda-forge mamba
- mamba install -c conda-forge pygis

**patchly change the name of data files**
`for file in 市*.geojson; do
  mv "$file" "${file/市/city-level}"
done`

### Import Packages

In [29]:
import ee
import geemap
import os
import glob
import geopandas as gpd
import pandas as pd
import numpy as np

In [32]:
ee.Authenticate() 

True

In [33]:
geemap.ee_initialize()

### Data Preparation

#### Step 1: Reading GEOJson Files

In [5]:
# This step loads all the city-level GeoJSON data from the GeoJSON files into a dictionary of GeoDataFrames
# Path to the directory containing my GEOJson files
data_directory1 = "./data/raw_data/city_geojson"

# Dictionary to hold my data, with years as keys
data_by_year = {}

for file_name in os.listdir(data_directory1):
    if file_name.startswith('city-level') and file_name.endswith('.geojson'):
        # Extract the year from the file name 
        # (assuming the format "city-levelYEAR.geojson")
        # This slices the last 12 characters from the filename, 
        # then takes the first 4 as the year
        year = int(file_name[-12:-8])
        file_path = os.path.join(data_directory1, file_name)
        print(f"Processing file: {file_path}")
        data_by_year[year] = gpd.read_file(file_path)
        print(f"Data loaded for year {year}")



Processing file: ./data/raw_data/city_geojson/city-level1996.geojson
Data loaded for year 1996
Processing file: ./data/raw_data/city_geojson/city-level2003.geojson
Data loaded for year 2003
Processing file: ./data/raw_data/city_geojson/city-level2013.geojson
Data loaded for year 2013
Processing file: ./data/raw_data/city_geojson/city-level1997.geojson
Data loaded for year 1997
Processing file: ./data/raw_data/city_geojson/city-level2012.geojson
Data loaded for year 2012
Processing file: ./data/raw_data/city_geojson/city-level2002.geojson
Data loaded for year 2002
Processing file: ./data/raw_data/city_geojson/city-level2000.geojson
Data loaded for year 2000
Processing file: ./data/raw_data/city_geojson/city-level2010.geojson
Data loaded for year 2010
Processing file: ./data/raw_data/city_geojson/city-level1995.geojson
Data loaded for year 1995
Processing file: ./data/raw_data/city_geojson/city-level2019.geojson
Data loaded for year 2019
Processing file: ./data/raw_data/city_geojson/city

##### 1. Print the first few rows of each GeoDataFrame to get a quick overview of the data:

In [6]:
# These following steps are to inspect the data
# Print the first few rows of each GeoDataFrame to get a quick overview of the data:
for year, data in data_by_year.items():
    print(f"Data for year {year}:")
    print(data.head())
    print("\n")

Data for year 1996:
     省     省代码      市     市代码  \
0  北京市  110000  北京市辖区  110100   
1  北京市  110000  北京市辖县  110200   
2  天津市  120000  天津市辖区  120100   
3  天津市  120000  天津市辖县  120200   
4  河北省  130000   石家庄市  130100   

                                            geometry  
0  MULTIPOLYGON (((115.95390 40.08780, 115.94240 ...  
1  MULTIPOLYGON (((116.66140 41.03630, 116.64820 ...  
2  MULTIPOLYGON (((118.05240 39.29560, 117.99580 ...  
3  MULTIPOLYGON (((117.15690 38.74750, 117.17020 ...  
4  MULTIPOLYGON (((113.84270 38.76480, 113.84310 ...  


Data for year 2003:
     省     省代码      市     市代码  \
0  北京市  110000  北京市辖区  110100   
1  北京市  110000  北京市辖县  110200   
2  天津市  120000  天津市辖区  120100   
3  天津市  120000  天津市辖县  120200   
4  河北省  130000   石家庄市  130100   

                                            geometry  
0  MULTIPOLYGON (((116.66140 41.03630, 116.64820 ...  
1  MULTIPOLYGON (((116.45630 40.77910, 116.44340 ...  
2  MULTIPOLYGON (((117.19980 39.83410, 117.17100 ...  
3  MULTIPO

##### 2. List all columns in the GeoDataFrames:

In [7]:
# List all columns in the GeoDataFrames, and get prepared to rename the columns:
for year, data in data_by_year.items():
    print(f"Columns for year {year}:")
    print(data.columns)
    print("\n")

Columns for year 1996:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2003:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2013:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 1997:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2012:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2002:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2000:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2010:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 1995:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2019:
Index(['省代码', '省', '市代码', '市', '类型', 'geometry'], dtype='object')


Columns for year 2009:
Index(['省', '省代码', '市', '市代码', 'geometry'], dtype='object')


Columns for year 2008:
Index(['省', '省代码', '市', '市代码', 'geom

##### 3. List all columns in the GeoDataFrames for year 2020 and 2021:

In [8]:
# List all columns in the GeoDataFrames for year 2020 and 2021 since I notice they are different from the rest of the years:
for year, data in data_by_year.items():
    if year in [2019, 2020, 2021]:
        print(f"Columns for year {year}:")
        print(data.columns)
        print("\n")

Columns for year 2019:
Index(['省代码', '省', '市代码', '市', '类型', 'geometry'], dtype='object')


Columns for year 2021:
Index(['省', '省代码', '省类型', '市', '市代码', '市类型', 'geometry'], dtype='object')


Columns for year 2020:
Index(['省', '省代码', '省类型', '市', '市代码', '市类型', 'geometry'], dtype='object')




##### 4. Rename columns to English

In [9]:
# Rename the columns in the GeoDataFrames for year 1990-2019 省 to province, 市 to city, 省代码 to province_code, 市代码 to city_code.
for year, data in data_by_year.items():
    if year in range(1990, 2020):
        data.rename(columns={'省': 'province', '市': 'city', '省代码': 'province_code', '市代码': 'city_code'}, inplace=True)
        print(f"Columns for year {year}:")
        print(data.columns)
        print("\n")

Columns for year 1996:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 2003:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 2013:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 1997:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 2012:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 2002:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 2000:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 2010:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


Columns for year 1995:
Index(['province', 'province_code', 'city', 'city_code', 'geometry'], dtype='object')


C

In [10]:
# Rename the columns in the GeoDataFrames for year 2020 and 2021 since they have different column names.
for year, data in data_by_year.items():
    if year in [2020, 2021]:
        data.rename(columns={'省': 'province', '省类型': 'province_level', 
                             '市': 'city', '市类型': 'city_level', '省代码': 'province_code', 
                             '市代码': 'city_code'}, inplace=True)
        print(f"Columns for year {year}:")
        print(data.columns)
        print("\n")

Columns for year 2021:
Index(['province', 'province_code', 'province_level', 'city', 'city_code',
       'city_level', 'geometry'],
      dtype='object')


Columns for year 2020:
Index(['province', 'province_code', 'province_level', 'city', 'city_code',
       'city_level', 'geometry'],
      dtype='object')




##### 5. Generate sepreated GEOJSON and csv files by province + year

In [11]:
# This step generates the GeoJSON and CSV files for each year and province for potential later use.
# Define the years to include
years = list(range(1990, 2022))

# Get the unique provinces
provinces = set(province for gdf in data_by_year.values() for province in gdf['province'].unique())

# Create a directory to store the CSV files and seperate geojson files
data_directory2 = "./data/working_data"

# Loop through all provinces and generate the GeoJSON, CSV, and combined CSV files
for province in provinces:
    # Initialize a list to store the DataFrames for all years
    dfs = []

    # Loop through all years
    for year in years:
        # Check if the year exists in the data
        if year in data_by_year:
            # Filter the data for the current province and year
            gdf = data_by_year[year][data_by_year[year]['province'] == province]
            
            # Check if the DataFrame is not empty
            if not gdf.empty:
                # Add a new column 'year' to the GeoDataFrame
                gdf.loc[:, 'year'] = year

                # Define the file path for the GeoJSON file
                file_path_geojson = os.path.join(data_directory2, 'geojson_by_province', f'{province}{year}.geojson')

                # Export the GeoDataFrame to a GeoJSON file
                gdf.to_file(file_path_geojson, driver='GeoJSON')

                # Define the file path for the CSV file
                file_path_csv = os.path.join(data_directory2, 'csv_by_province', f'{province}{year}.csv')

                # Export the GeoDataFrame to a CSV file
                gdf.to_csv(file_path_csv, index=False)

                # Add the DataFrame to the list
                dfs.append(gdf)

    # Concatenate the DataFrames for all years
    df_combined = pd.concat(dfs)

    # Define the file path for the combined CSV file
    file_path_csv_combined = os.path.join(data_directory2, 'csv_by_province', f'{province}_combined.csv')

    # Export the combined DataFrame to a CSV file
    df_combined.to_csv(file_path_csv_combined, index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = 

#### 6. Import only one province to inspect their boundaries change over time.

In [17]:
# Based on the previous steps, I can now inspect the data and merge the GeoDataFrames for the provinces of interest for their spatial boundaries.
# Define the provinces
provinces = ['西藏自治区', '四川省', '新疆维吾尔自治区', '云南省']

# Initialize an empty list to store the GeoDataFrames
gdfs = []

# Loop through all the provinces
for province in provinces:
    # Get a list of all the GeoJSON files for the province
    file_paths = glob.glob(os.path.join(data_directory2, 'geojson_by_province', f'{province}*.geojson'))

    # Loop through all the file paths and read each file
    for file_path in file_paths:
        gdf = gpd.read_file(file_path)
        gdfs.append(gdf)

# Concatenate all the GeoDataFrames into a single GeoDataFrame
gdf_merged = pd.concat(gdfs, ignore_index=True)

In [34]:
# Visualize the data using geemap and visually inspect their boundaries.
# Create a Map instance
Map = geemap.Map()

# Define the years to include and their corresponding colors
years = [1992, 2000, 2016]
colors = ['red', 'green', 'blue']

# Loop through the selected years and add each GeoDataFrame as a layer
for year, color in zip(years, colors):
    # Filter the GeoDataFrame for the current year
    gdf = gdf_merged[gdf_merged['year'] == year]
    
    style = {
    "stroke": True,
    "color": color,  # change the color for each year
    "weight": 2,
    "opacity": 1,
    "fill": True,
    "fillColor": color,  # change the fill color for each year
    "fillOpacity": 0.1,
    }

    hover_style = {"fillOpacity": 0.6}

    Map.add_gdf(gdf, layer_name=f'City boundaries in {year}', style=style, hover_style=hover_style)

Map

AttributeError: type object 'Reducer' has no attribute 'mean'

#### Step 2: Import and Clean City-level Census Datasets and Prepare for a Table Join with the 2021 Geojson File

In [87]:
# Define the path to the directory
path = "./data/raw_data/census_data_by_city"

# Get a list of all Excel files in the directory
all_files = glob.glob(os.path.join(path, "*.xlsx"))

# Prepare a list to hold all dataframes
all_dfs = []

# Iterate over all files
for file in all_files:
    # Read the Excel file
    df = pd.read_excel(file)
    
    # Drop the first row which seems to be a merged header and does not contribute to data structure understanding
    df = df.drop(index=0).reset_index(drop=True)

    # Identify columns for each year and corresponding indicators
    years = df.columns[1::7]  # Starting from the second column, every 7th column represents a new year

    # Prepare a list to hold the transformed data for each year
    transformed_data_list = []

    # Iterate over each year, select relevant columns for that year, and transform
    for year in years:
        # The year column index
        year_index = df.columns.get_loc(year)
        
        # Select columns for the current year
        year_data = df.iloc[:, [0, year_index, year_index+1, year_index+2, year_index+3, year_index+4, year_index+5, year_index+6]]
        
        # Rename columns to reflect the indicators
        year_data.columns = ['city', 'GDP(100million)', 'VA_primary(100million)', 'VA_secondary(100million)', 'VA_tertiary(100million)', 'total_pop(10k)', 'resident_pop(10k)', 'CPI']
        
        # Add a year column
        year_data.insert(1, 'year', year.rstrip('年'))
        
        # Append to list
        transformed_data_list.append(year_data)

    # Concatenate all yearly data into a single DataFrame
    transformed_df = pd.concat(transformed_data_list, ignore_index=True)
    
    # Append the transformed DataFrame to the list of all dataframes
    all_dfs.append(transformed_df)

# Concatenate all dataframes into a single DataFrame
final_df = pd.concat(all_dfs, ignore_index=True)

In [88]:
# Replace '--' with NaN
final_df.replace('--', np.nan, inplace=True)

# Convert 'year' to integer
final_df['year'] = final_df['year'].astype(int)

numeric_columns = ['GDP(100million)', 'VA_primary(100million)', 'VA_secondary(100million)', 'VA_tertiary(100million)', 'total_pop(10k)', 'resident_pop(10k)', 'CPI']

for col in numeric_columns:
      final_df[col] = final_df[col].astype(float)

# Check the data types again
print(final_df.dtypes)

city                         object
year                          int64
GDP(100million)             float64
VA_primary(100million)      float64
VA_secondary(100million)    float64
VA_tertiary(100million)     float64
total_pop(10k)              float64
resident_pop(10k)           float64
CPI                         float64
dtype: object


In [89]:
# Filter for years 1990 to 2021
final_df = final_df[(final_df['year'] >= 1990) & (final_df['year'] <= 2021)]

# Separate data for '莱芜市' and '济南市'
laiwu_df = final_df[final_df['city'] == '莱芜市'].set_index('year')
jinan_df = final_df[final_df['city'] == '济南市'].set_index('year')

# Add the values of '莱芜市' to '济南市' for each column and each year, treating NaN as 0
summed_df = laiwu_df.add(jinan_df, fill_value=0)

# Set the city name to '济南市'
summed_df['city'] = '济南市'

# Remove '莱芜市' and '济南市' from the original dataframe
final_df = final_df[(final_df['city'] != '莱芜市') & (final_df['city'] != '济南市')]

# Append the summed dataframe to the original dataframe
final_df = pd.concat([final_df, summed_df.reset_index()])

# Sort by 'city' and 'year'
final_df = final_df.sort_values(['city', 'year'])

# Reset the index
final_df.reset_index(drop=True, inplace=True)

In [90]:
# List of province regions
province_regions = ['辽宁省', '吉林省', '黑龙江', '贵州省', '云南省', '河北省', '内蒙古', '安徽省', '山东省', '山西省', '四川省', '广东省', '广西', '新疆', '江苏省', '浙江省', '江西省', '河南省', '海南省', '湖北省', '湖南省', '福建省', '西藏', '陕西省', '甘肃省', '宁夏', '青海省']

# Filter rows where 'city' is in the list of province regions and rename the 'city' column to 'province'
province_df = final_df[final_df['city'].isin(province_regions)]
province_df.rename(columns={'city': 'province'}, inplace=True)

# Remove these rows from the original DataFrame
final_df = final_df[~final_df['city'].isin(province_regions)]

# Reset the indices
final_df.reset_index(drop=True, inplace=True)
province_df.reset_index(drop=True, inplace=True)

In [91]:
# Define the columns to check for missing values for province_df
columns_to_check = ['GDP(100million)', 'VA_primary(100million)', 'VA_secondary(100million)', 'VA_tertiary(100million)', 'total_pop(10k)', 'resident_pop(10k)', 'CPI']

# Find rows where any of the specified columns are missing
missing_values = province_df[province_df[columns_to_check].isna().any(axis=1)]

# For each row, find the names of the columns that are missing
missing_columns = missing_values[columns_to_check].isna().apply(lambda x: list(x.index[x]), axis=1)

# Create a new DataFrame that includes the 'province', 'year', and the names of the columns that are missing
missing_values_df = pd.DataFrame({
    'province': missing_values['province'],
    'year': missing_values['year'],
    'missing_columns': missing_columns
})

# Export to CSV
missing_values_df.to_csv('missing_values.csv', index=False)

In [92]:
# Perform an outer join of final_df with data_by_year[2021] without the 'CPI' column
outer_joined_df = final_df.drop(columns='CPI').merge(data_by_year[2021], on='city', how='outer', indicator=True)

# Find out which rows were not joined
not_joined_final_df = outer_joined_df[outer_joined_df['_merge'] == 'left_only']
not_joined_data_by_year_2021 = outer_joined_df[outer_joined_df['_merge'] == 'right_only']

# Print the rows that were not joined
print("Rows in final_df that were not joined:\n", not_joined_final_df)
print("Rows in data_by_year[2021] that were not joined:\n", not_joined_data_by_year_2021)

# Export the dataframes to CSV
not_joined_final_df.to_csv('not_joined_final_df.csv', index=False)
not_joined_data_by_year_2021.to_csv('not_joined_data_by_year_2021.csv', index=False)

# Perform an inner join of final_df with data_by_year[2021] without the 'CPI' column
joined_df = final_df.drop(columns='CPI').merge(data_by_year[2021], on='city', how='inner')

Rows in final_df that were not joined:
                  city    year  GDP(100million)  VA_primary(100million)  \
8256   省直辖县级行政区划(河南省)  1990.0              NaN                     NaN   
8257   省直辖县级行政区划(河南省)  1991.0              NaN                     NaN   
8258   省直辖县级行政区划(河南省)  1992.0              NaN                     NaN   
8259   省直辖县级行政区划(河南省)  1993.0              NaN                     NaN   
8260   省直辖县级行政区划(河南省)  1994.0              NaN                     NaN   
...               ...     ...              ...                     ...   
8827  自治区直辖县级行政区划(新疆)  2017.0              NaN                     NaN   
8828  自治区直辖县级行政区划(新疆)  2018.0              NaN                     NaN   
8829  自治区直辖县级行政区划(新疆)  2019.0              NaN                     NaN   
8830  自治区直辖县级行政区划(新疆)  2020.0              NaN                     NaN   
8831  自治区直辖县级行政区划(新疆)  2021.0              NaN                     NaN   

      VA_secondary(100million)  VA_tertiary(100million)  total_pop(10k)

In [93]:
# List of columns to impute
columns_to_impute = ['GDP(100million)', 'VA_primary(100million)', 'VA_secondary(100million)', 'VA_tertiary(100million)']

# Sort the DataFrame by 'year' to ensure that the data is in the correct order
joined_df = joined_df.sort_values('year')

# Create a copy of the DataFrame to avoid modifying the original data
df_imputed = joined_df.copy()

# Assuming 'joined_df' is already sorted by 'year'

for column in columns_to_impute:
    for i in joined_df.index:
        # Ensure we're not at the boundaries of the DataFrame
        if i > joined_df.index[0] and i < joined_df.index[-1]:
            # Check if the current value is NaN
            if pd.isna(joined_df.at[i, column]):
                # Get index positions for previous and next rows
                prev_index = i - 1
                next_index = i + 1
                
                # Check if both previous and next values are non-NaN
                if not pd.isna(joined_df.at[prev_index, column]) and not pd.isna(joined_df.at[next_index, column]):
                    # Impute the missing value
                    df_imputed.at[i, column] = (joined_df.at[prev_index, column] + joined_df.at[next_index, column]) / 2


# Check if there are still any missing values in the imputed columns
print(df_imputed[columns_to_impute].isnull().sum())

df_imputed.head()


GDP(100million)             602
VA_primary(100million)      658
VA_secondary(100million)    695
VA_tertiary(100million)     733
dtype: int64


Unnamed: 0,city,year,GDP(100million),VA_primary(100million),VA_secondary(100million),VA_tertiary(100million),total_pop(10k),resident_pop(10k),province,province_code,province_level,city_code,city_level,geometry
0,七台河市,1990,12.442,2.43,6.392,3.6204,77.2,,黑龙江省,230000.0,省,230900.0,地级市,"MULTIPOLYGON (((131.49808 46.24715, 131.48547 ..."
10912,阿坝藏族羌族自治州,1990,11.09,,,,77.5703,,四川省,510000.0,省,513200.0,自治州,"MULTIPOLYGON (((102.91143 34.31419, 102.91420 ..."
2464,咸宁市,1990,26.08,12.98,8.16,4.94,242.84,,湖北省,420000.0,省,421200.0,地级市,"MULTIPOLYGON (((114.04152 30.21689, 114.03973 ..."
8224,盘锦市,1990,45.6,6.9859,28.6279,9.9754,104.9,,辽宁省,210000.0,省,211100.0,地级市,"MULTIPOLYGON (((122.18755 41.44637, 122.19244 ..."
4992,攀枝花市,1990,21.6489,2.3885,13.4018,5.8586,90.85,,四川省,510000.0,省,510400.0,地级市,"MULTIPOLYGON (((101.63373 27.34194, 101.63269 ..."


In [94]:
# Import and clean supplementary data for missing values
## Yunnan
supple_data_path = "./data/raw_data/missing_value"

yunnan_df = pd.read_csv(os.path.join(supple_data_path, '云南补充.csv'))

yunnan_df.head()

# Translate column names
translated_columns = {
    '指标': 'Indicator',
    '地区': 'city',
    '时间': 'year',
    '数值': 'Value'
}
yunnan_df.rename(columns=translated_columns, inplace=True)

# Check for unique indicators to understand how to pivot the data
unique_indicators = yunnan_df['Indicator'].unique()

unique_indicators

# Remove non-indicator rows (filter out rows where 'Year' is NaN or 'Indicator' contains non-indicator text)
clean_df = yunnan_df.dropna(subset=['year']).copy()

# Pivot the DataFrame
yunnan_df = clean_df.pivot_table(index=['city', 'year'], columns='Indicator', values='Value', aggfunc='first').reset_index()

# Rename columns to match the required structure
yunnan_df.columns = ['city', 'year', 'GDP(100million)', 'VA_primary(100million)', 'VA_secondary(100million)', 'VA_tertiary(100million)']

yunnan_df.head(200)



FileNotFoundError: [Errno 2] No such file or directory: './data/raw_data/missing_value/云南补充.csv'

In [95]:
# Import and clean supplementary data for missing values
supple_data_path = "./data/raw_data/missing_value"

supple_df = pd.read_csv(os.path.join(supple_data_path, 'supplement_missing_value'))

supple_df.head()

# Translate column names to English
supple_df.columns = ['Indicator', 'city', 'year', 'Value']

# Check unique values in 'Indicator' to understand how to restructure the dataset
unique_indicators = supple_df['Indicator'].unique()
unique_indicators

# Map the indicators to the desired column names and conversion factors
indicator_mapping = {
    '地区生产总值（亿元）': ('GDP(100million)', 1),
    '第一产业增加值（亿元）': ('VA_primary(100million)', 1),
    '第二产业增加值（亿元）': ('VA_secondary(100million)', 1),
    '第三产业增加值（亿元）': ('VA_tertiary(100million)', 1),
    '地区生产总值（万元）': ('GDP(100million)', 0.0001),
    '第一产业增加值（万元）': ('VA_primary(100million)', 0.0001),
    '第二产业增加值（万元）': ('VA_secondary(100million)', 0.0001),
    '第三产业增加值（万元）': ('VA_tertiary(100million)', 0.0001),
    '生产总值（按当年价格计算）（亿元）': ('GDP(100million)', 1) # Assuming it should be treated the same as GDP
}

# Apply the mapping to create new columns for Indicator and conversion factor
supple_df['Mapped_Indicator'] = supple_df['Indicator'].apply(lambda x: indicator_mapping[x][0])
supple_df['Conversion_Factor'] = supple_df['Indicator'].apply(lambda x: indicator_mapping[x][1])

# Convert values
supple_df['Converted_Value'] = supple_df['Value'] * supple_df['Conversion_Factor']

# Replace 0 values with NaN
supple_df['Converted_Value'].replace(0, np.nan, inplace=True)

# Pivot the table to have separate columns for each indicator
pivot_supdf = supple_df.pivot_table(values='Converted_Value', index=['city', 'year'], columns='Mapped_Indicator', aggfunc='sum').reset_index()

# Replace 0 values with NaN in the pivoted DataFrame
pivot_supdf.replace(0, np.nan, inplace=True)

pivot_supdf.head(1000)




Mapped_Indicator,city,year,GDP(100million),VA_primary(100million),VA_secondary(100million),VA_tertiary(100million)
0,万宁市,1990,4.77520,3.1961,0.49460,1.08450
1,万宁市,1991,5.21770,3.4005,0.61430,1.20290
2,万宁市,1992,6.55280,3.6314,1.05030,1.87110
3,万宁市,1993,10.47780,5.4174,2.28890,2.77150
4,万宁市,1994,15.63770,7.0934,4.40510,4.13920
...,...,...,...,...,...,...
995,百色市,2018,1176.77322,195.1754,563.48828,418.10954
996,百色市,2019,1257.78000,245.1758,508.46484,504.13677
997,百色市,2020,1333.73000,259.3700,531.11000,543.25000
998,神农架林区,1990,0.78000,,,


In [96]:
# Check if there are any remaining 0 values in the DataFrame
zero_values = (pivot_supdf == 0).any().any()

print("Are there any 0 values in the DataFrame?", zero_values)

Are there any 0 values in the DataFrame? False


In [97]:
# Ensure that the 'Year' columns in both dataframes are of the same data type
df_imputed['year'] = df_imputed['year'].astype(pivot_supdf['year'].dtype)

# List of columns to fill
columns_to_fill = ['GDP(100million)', 'VA_primary(100million)', 'VA_secondary(100million)', 'VA_tertiary(100million)']

for column in columns_to_fill:
    # Check if column exists in both dataframes
    if column in df_imputed.columns and column in pivot_supdf.columns:
        # Create a temporary dataframe with just the keys and the value you want to use for filling
        temp_df = pivot_supdf[['city', 'year', column]]

        # Merge the temporary dataframe with the main dataframe to align the missing values
        df_imputed = df_imputed.merge(temp_df, on=['city', 'year'], suffixes=('', '_supp'), how='left')

        # Fill NaN values in the main dataframe with the supplementary values
        df_imputed[column].update(df_imputed.pop(column + '_supp'))

In [98]:
# Reshape the DataFrame
stacked_df = df_imputed.drop(columns=['resident_pop(10k)', 'total_pop(10k)']).set_index(['city', 'year', 'province']).stack(dropna=False)

# Reset the index
stacked_df = stacked_df.reset_index()

# Rename the columns
stacked_df.columns = ['city', 'year', 'province', 'variable', 'value']

# Filter out the rows where the value is missing
missing_values_df = stacked_df[stacked_df['value'].isnull()]

# Drop the 'value' column as it's not needed
missing_values_df = missing_values_df.drop(columns='value')

# Export the dataframe to a CSV file
missing_values_df.to_csv('missing_values.csv', index=False)