# Capstone Assignment 20.1: Initial Report and Exploratory Data Analysis (EDA)

Nathan Oyama

## 1 &emsp; Planning the project

There are three data sets to accomplish this project. 

* Kaggle: "Percent Sunshine by US City". 18 Jan 2023. kaggle.com/datasets/thedevastator/annual-percent-of-possible-sunshine-by-us-city.

* US Geological Survey: "The United States Large-Scale Solar Photovoltaic Database (USPVDB)". 28 Apr 2025. US Department of the Interior. energy.usgs.gov/uspvdb/data.

* landvalue: "ZHVI 3-Bedroom Time Series($) - City". 25 Jun 2025. landvalue.com/research/data.

* Pareto Software: "United States Cities Database - Basic". 9 Jun 2025. simplemaps.com/data/us-cities.

Then take the following steps _for every dataset_:

1. From the data set which is in CSV format, create a pandas DataFrame object.
1. Analyze every DataFrame and identify which columns to use for this project.
1. Format the DataFrames before merge them.

Finally, merge the three DataFrames into one.

## 2 &emsp; Analyzing Data Sets

Analyze those three data sets.

Import required package resources before working on all data sets:

In [1]:
import numpy as np
import pandas as pd

pd.options.mode.copy_on_write = True

from sklearn.model_selection import train_test_split, GridSearchCV

import warnings

warnings.filterwarnings("ignore", message=".*pkg_resources is deprecated as an API.*")
warnings.filterwarnings("ignore", category=UserWarning)

### 2.1 &emsp; Analyzing Data Set 1: "Percent Sunshine by US City"

In [2]:
df_sunshine_original = pd.read_csv(
    './data/Average Percent of Possible Sunshine by US City.csv'
    )

print(df_sunshine_original.head())

   index           CITY JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ANN  \
0      0  BIRMINGHAM,AL  46  53  57  65  65  67  59  62  59  66  55  49  58   
1      1  MONTGOMERY,AL  47  55  58  64  63  64  61  61  59  63  55  49  58   
2      2   ANCHORAGE,AK  43  46  51  50  51  46  43  43  41  36  35  33  43   
3      3      JUNEAU,AK  39  35  38  42  44  37  33  35  27  21  26  21  33   
4      4        NOME,AK  38  56  54  52  52  43  39  34  38  35  30  36  42   

   Unnamed: 14  
0          NaN  
1          NaN  
2          NaN  
3          NaN  
4          NaN  


See the first few records of the original DataFrame for sunshine hours. You can ignore and discard some unimportant columns: "index" and "Unnamed: 14".

The remaining columns are the "CITY" column, the columns of all 12 months such as "JAN" and "FEB", and the annual. The "CITY" column includes the name of the city in all uppercase, followed by a comma (",") and the state abbreviation. You can use this column for the index of this DataFrame. The column for each month represents the number of sunshine hours of the month for every city. The "ANN" field represents the average of those monthly sunshine hours. For example, the city of Birmingham, Alabama observed approximately 46 sunshine hours in January; in average, Birmingham observed approximately 58 sunshine hours per month.

In [3]:
df_sunshine_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   index        317 non-null    int64  
 1   CITY         317 non-null    object 
 2   JAN          307 non-null    object 
 3   FEB          309 non-null    object 
 4   MAR          309 non-null    object 
 5   APR          309 non-null    object 
 6   MAY          311 non-null    object 
 7   JUN          311 non-null    object 
 8   JUL          317 non-null    object 
 9   AUG          317 non-null    object 
 10  SEP          317 non-null    object 
 11  OCT          317 non-null    object 
 12  NOV          317 non-null    object 
 13  DEC          313 non-null    object 
 14  ANN          307 non-null    object 
 15  Unnamed: 14  0 non-null      float64
dtypes: float64(1), int64(1), object(14)
memory usage: 39.8+ KB


### 2.2 &emsp; Analyzing Data Set 2: "The US Large-Scale Solar Photovoltaic Database (USPVDB)"

In [4]:
df_photovoltaic_original = pd.read_csv(
    './data/uspvdb_v3_0_20250430.csv'
    )

print(df_photovoltaic_original.shape)

(5712, 26)


In [5]:
print(df_photovoltaic_original.iloc[:,:13].head())

   case_id multi_poly  eia_id p_state           p_county       ylat  \
0   406374     single   66887      AK  Matanuska-Susitna  61.587349   
1   405016      multi    6304      AK   Northwest Arctic  66.838470   
2   401476      multi   60058      AL         Lauderdale  34.833809   
3   401865      multi   60679      AL               Dale  31.331732   
4   401866      multi   60680      AL            Calhoun  33.626301   

        xlong   p_area  p_img_date  p_dig_conf                   p_name  \
0 -149.789413   172005    20240814           4            Houston Solar   
1 -162.553146     8740    20240719           4          Kotzebue Hybrid   
2  -87.838394  1735134    20220212           4    River Bend Solar, LLC   
3  -85.729469   187820    20220609           4  Fort Rucker Solar Array   
4  -85.940590    39717    20210814           4         ANAD Solar Array   

   p_year p_pwr_reg  
0    2023        AK  
1    2020       NaN  
2    2016       TVA  
3    2017      SOCO  
4    2017   

In [6]:
print(df_photovoltaic_original.iloc[:,13:26].head())

  p_tech_pri p_tech_sec p_sys_type       p_axis  p_azimuth  p_tilt  p_battery  \
0         PV        NaN     ground   fixed-tilt      180.0    40.0        NaN   
1         PV        NaN     ground  single-axis      156.0    40.0  batteries   
2         PV       c-si     ground  single-axis      270.0    17.0        NaN   
3         PV  thin-film     ground  single-axis      188.0    20.0        NaN   
4         PV  thin-film     ground   fixed-tilt      180.0    20.0        NaN   

   p_cap_ac  p_cap_dc      p_type       p_agrivolt p_comm  p_zscore  
0       6.0       8.4  greenfield             crop    NaN -0.457675  
1       1.7       3.4  greenfield  non-agrivoltaic    NaN  5.617232  
2      75.0     100.2  greenfield  non-agrivoltaic    NaN -0.298527  
3      10.6      12.7  greenfield  non-agrivoltaic    NaN -0.122265  
4       7.4       9.7   superfund  non-agrivoltaic    NaN  3.031619  


In [7]:
df_photovoltaic_original.query('p_cap_ac.isnull() | p_cap_dc.isnull()').shape

(0, 26)

### 2.3 &emsp; Analyzing Data Set 3: "ZHVI 3-Bedroom Time Series($) - City"

In [8]:
df_landvalue_original = pd.read_csv(
    './data/City_zhvi_bdrmcnt_3_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv.zip',
    compression='zip'
    )

df_landvalue_original.columns

Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
       'State', 'Metro', 'CountyName', '2000-01-31', '2000-02-29',
       ...
       '2024-08-31', '2024-09-30', '2024-10-31', '2024-11-30', '2024-12-31',
       '2025-01-31', '2025-02-28', '2025-03-31', '2025-04-30', '2025-05-31'],
      dtype='object', length=313)

In [9]:
df_landvalue_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15711 entries, 0 to 15710
Columns: 313 entries, RegionID to 2025-05-31
dtypes: float64(305), int64(2), object(6)
memory usage: 37.5+ MB


In [10]:
print(
    df_landvalue_original[[
        'RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
        'State', 'Metro', 'CountyName', '2025-05-31'
        ]].head()
    )

   RegionID  SizeRank   RegionName RegionType StateName State  \
0      6181         0     New York       city        NY    NY   
1     12447         1  Los Angeles       city        CA    CA   
2     39051         2      Houston       city        TX    TX   
3     17426         3      Chicago       city        IL    IL   
4      6915         4  San Antonio       city        TX    TX   

                                   Metro          CountyName     2025-05-31  
0  New York-Newark-Jersey City, NY-NJ-PA       Queens County  840048.900964  
1     Los Angeles-Long Beach-Anaheim, CA  Los Angeles County  964249.977821  
2   Houston-The Woodlands-Sugar Land, TX       Harris County  253134.060059  
3     Chicago-Naperville-Elgin, IL-IN-WI         Cook County  336756.496352  
4          San Antonio-New Braunfels, TX        Bexar County  235986.092899  


### 2.4 &emsp; City information

This data set contains the name of most cities in the United States and data fields for every city: the name, longitude, latitude, population density, and so on.

In [11]:
df_city_original = pd.read_csv(
    './data/uscities.csv'
    )

print(df_city_original.head())
print("")
print("df_city_original.shape ...", df_city_original.shape)

          city   city_ascii state_id  state_name  county_fips  county_name  \
0     New York     New York       NY    New York        36081       Queens   
1  Los Angeles  Los Angeles       CA  California         6037  Los Angeles   
2      Chicago      Chicago       IL    Illinois        17031         Cook   
3        Miami        Miami       FL     Florida        12086   Miami-Dade   
4      Houston      Houston       TX       Texas        48201       Harris   

       lat       lng  population  density source  military  incorporated  \
0  40.6943  -73.9249    18832416  10943.7  shape     False          True   
1  34.1141 -118.4068    11885717   3165.7  shape     False          True   
2  41.8375  -87.6866     8489066   4590.3  shape     False          True   
3  25.7840  -80.2101     6113982   4791.1  shape     False          True   
4  29.7860  -95.3885     6046392   1386.2  shape     False          True   

              timezone  ranking  \
0     America/New_York        1   
1  A

In this data set, there are the `city` column and the `city_ascii` column that you may want to check the difference:

In [12]:
df_city_diff = \
    df_city_original[['city','city_ascii','state_id']]\
    .query('city != city_ascii')

print(
    "The first few rows out of", df_city_diff.shape[0], "rows "
    "whose values in the city field and city_ascii field are different: "
    "\n"
    )

print(df_city_diff.head())

The first few rows out of 76 rows whose values in the city field and city_ascii field are different: 

            city  city_ascii state_id
265      Bayamón     Bayamon       PR
484   San Germán  San German       PR
525     Mayagüez    Mayaguez       PR
752   Juana Díaz  Juana Diaz       PR
2014      Cataño      Catano       PR


There should be many city names that contain Spanish letters on Puerto Rico. Check all cities on Puerto Rico in the sunshine data set:

In [13]:
print(
    df_sunshine_original.iloc[1:158].query('CITY.str.contains(",PR")')['CITY']
    )

157    SAN JUAN,PR
Name: CITY, dtype: object


The sunshine data set contained only one city on Puerto Rico: San Juan. Check how "San Juan" is defined in the city data set:

In [14]:
print(
    df_city_original[['city','city_ascii','state_id']]\
        .query('city_ascii == "San Juan"')
    )

          city city_ascii state_id
29    San Juan   San Juan       PR
1298  San Juan   San Juan       TX


There are actually two cities of San Juan: one on Puerto Rico and the other one in Texas. But neither city name contains non-ascii letter and so you can keep the first record for San Juan, Puerto Rico.

Check all cities containing non-ascii letters outside Puerto Rico:

In [15]:
print(df_city_diff.query('state_id != "PR"'))

                       city            city_ascii state_id
2250   La Cañada Flintridge  La Canada Flintridge       CA
2627             Cañon City            Canon City       CO
3944               Española              Espanola       NM
5238            Piñon Hills           Pinon Hills       CA
11895          César Chávez          Cesar Chavez       TX
14389              Doña Ana              Dona Ana       NM
15643             Cañoncito             Canoncito       NM
17598           Peña Blanca           Pena Blanca       NM
18469               Peñasco               Penasco       NM
19962  Cañada de los Alamos  Canada de los Alamos       NM
20794                 Cañon                 Canon       NM
27154              Salineño              Salineno       TX
28089               Cañones               Canones       NM
28160        Salineño North        Salineno North       TX
29328                Lopeño                Lopeno       TX


These cities are located in California, Colorado, New Mexico, or Texas. Check cities in those states in the sunshine data set:

In [16]:
print(df_sunshine_original.iloc[1:158].query('False \
    |   CITY.str.contains(",CA") \
    |   CITY.str.contains(",CO") \
    |   CITY.str.contains(",NM") \
    |   CITY.str.contains(",TX") \
    ')['CITY'])

10             FRESNO,CA
11        LOS ANGELES,CA
12         SACRAMENTO,CA
13          SAN DIEGO,CA
14      SAN FRANCISCO,CA
15             DENVER,CO
16     GRAND JUNCTION,CO
17             PUEBLO,CO
84        ALBUQUERQUE,NM
85            ROSWELL,NM
123           ABILENE,TX
124          AMARILLO,TX
125            AUSTIN,TX
126       BROWNSVILLE,TX
127    CORPUS CHRISTI,TX
128            DALLAS,TX
129           EL PASO,TX
130           HOUSTON,TX
131           LUBBOCK,TX
132    MIDLAND-ODESSA,TX
133       PORT ARTHUR,TX
134       SAN ANTONIO,TX
Name: CITY, dtype: object


Any of these cities in the sunshine data set appeared in the city data set where the city name contained non-English letters. Therefore, you can remove all records containing non-English letters from the city data set.

### 2.4 &emsp; Planning the merged Data Frame


A combined DataFrame: `df_solar`

| Column               | Example          | Data Sets                          |
| :------------------- | :--------------- | :--------------------------------- |
| County-State         | BERKELEY,CA      | City, Photovoltaic, Land Value     |
| City-State           | ALAMEDA,CA       | City, Sunshine, Land Value         |
| Longitude            | -149.789413.     | City, Photovoltaic                 |
| Latitude             | 61.587349        | Sunshine, Photovoltaic, Land Value |
| ANN                  | 58               | Sunshine                           |
| JAN ... DEC          | 58               | Sunshine                           |
| Current_log          | 14.4             | Logarithm of (DC + AC)             |
| Land Value           | 840048.900963529 | Land Value                         |

Some cities that are listed in the Land Value data set and the Sunshine data set do not have solar power plants. In later steps, you predict the Current_log in cities that do not have solar power plants and tell which cities you should build solar power plants.


Note that in the United States, city names are unique only within the same state. Some cities share the same name and even the same county name while those are located in different states.

| Column | Sunshine | Land Value  | PV      | City    | Example 1  | Example 2  |
| :----- | :------: | :---------: | :-----: | :-----: | :--------- | :--------- |
| City   | &#9679;  | &#9679;     | -       | &#9679; | Franklin   | Franklin   |
| County | -        | &#9679;     | &#9679; | &#9679; | Williamson | Williamson |
| State  | &#9679;  | &#9679;     | &#9679; | &#9679; | Tennessee  | Texas      |


### 2.4. Converting Data Frames

The original data sets are stored in CSV format. To load these CSV data sets and convert these to the pandas' DataFrame objects, all numeric entries should be recognized as either integer data type of float data type.

In [17]:
def convert_df_obj_numeric(df):
    cols_obj = df.select_dtypes(include='object').columns
    df[cols_obj] = df[cols_obj].apply(pd.to_numeric, errors='coerce')
    return df

## 3 &emsp; Extracting Data Sets

### 3.1 &emsp; Dataset 1: Sunshine

In [18]:
df_sunshine_original = pd.read_csv(
    './data/Average Percent of Possible Sunshine by US City.csv'
    )

Check the `CITY` column:

In [19]:
print(df_sunshine_original[['CITY']].value_counts())

CITY                  
ABERDEEN,SD               2
PENSACOLA,FL              2
NOME,AK                   2
NORFOLK,VA                2
NORTH PLATTE,NE           2
                         ..
GRAND RAPIDS,MI           2
GREAT FALLS,MT            2
GREEN BAY,WI              2
YAP- W CAROLINE IS.,PC    2
CITY                      1
Name: count, Length: 159, dtype: int64


There is one invalid entry `"CITY"`, and every other city has exact two entries. Check the row where the `CITY` column is `CITY`:

In [20]:
print(df_sunshine_original[df_sunshine_original['CITY'] == 'CITY'])

     index  CITY  JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC  \
158    158  CITY  JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC   

     ANN  Unnamed: 14  
158  ANN          NaN  


In [21]:
print(df_sunshine_original.sort_values(by=['CITY','index']).head(10))

     index            CITY  JAN  FEB  MAR  APR  MAY  JUN JUL AUG SEP OCT NOV  \
115    115     ABERDEEN,SD  NaN   54   58   63   65   66  74  78  68  48  21   
315    315     ABERDEEN,SD  NaN   54   58   63   65   66  74  78  68  48  21   
123    123      ABILENE,TX   63   66   70   71   71   77  80  75  69  68  64   
182    182      ABILENE,TX   63   66   70   71   71   77  80  75  69  68  64   
86      86       ALBANY,NY   46   52   51   55   53   55  62  58  54  46  33   
287    287       ALBANY,NY   46   52   51   55   53   55  62  58  54  46  33   
84      84  ALBUQUERQUE,NM   73   73   73   78   80   82  76  76  77  80  75   
169    169  ALBUQUERQUE,NM   73   73   73   78   80   82  76  76  77  80  75   
108    108    ALLENTOWN,PA  NaN  NaN  NaN  NaN  NaN  NaN  90  93  82  52  47   
314    314    ALLENTOWN,PA  NaN  NaN  NaN  NaN  NaN  NaN  90  93  82  52  47   

     DEC  ANN  Unnamed: 14  
115  NaN  NaN          NaN  
315  NaN  NaN          NaN  
123   65   69          NaN  
182

In this data set, the row where `'index'` is 158 does not include appropriate values thus you can remove it.

See the `"index"` field of every pairs of rows for every city; one of those rows have the value of less than 158 and the other rows have the value of greater than 158. Also all other values such as `"ANN"` are the same.

Assume that this data set includes two tables with the same rows in different orders. The first table spans between line 1 and line 157 in the original CSV file, and the second table spans between line 158 and line 318. You only need the first one.

Construct a new DataFrame that is based on the original data set for the sunshine information with the following changes:

* Include all rows of the first 157 records.
* Rename the `CITY` column `City-State`.
* Change the index column from the `index` column to the `City-State` column.
* Trim off the unnecessary columns: `index` and `"Unnamed: 14"`.

In [22]:
df_sunshine = df_sunshine_original.iloc[1:158]

df_sunshine.rename(columns={'CITY': 'City-State'}, inplace=True)

df_sunshine.set_index(['City-State'], inplace=True)

del df_sunshine['index']
del df_sunshine['Unnamed: 14']

Check rows that include a null value in any field:

In [23]:
print(df_sunshine[df_sunshine.isna().any(axis=1)])

               JAN  FEB  MAR  APR  MAY  JUN JUL AUG SEP OCT NOV  DEC  ANN
City-State                                                               
TUPELO,MS      NaN  NaN  NaN  NaN  NaN  NaN  66  59  61  67  57   45  NaN
CINCINNATI,OH  NaN  NaN  NaN  NaN   47   70  85  76  77  50  44   30  NaN
ALLENTOWN,PA   NaN  NaN  NaN  NaN  NaN  NaN  90  93  82  52  47  NaN  NaN
ABERDEEN,SD    NaN   54   58   63   65   66  74  78  68  48  21  NaN  NaN
ELKINS,WV      NaN  NaN  NaN  NaN  NaN  NaN  69  50  52  43  31   18  NaN


There are 5 rows which constitutes ~3.2% of the whole DataFrame. Because these rows seem almost impossible to find appropriate values to fill in the null fields, you should discard these 5 records.

In [24]:
df_sunshine.dropna(inplace=True)

In [25]:
del df_sunshine['ANN']

Convert entries in the numeric columns to numeric data type:

In [26]:
cols_numeric = [
    'JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'
    ]

df_sunshine[cols_numeric] = df_sunshine[cols_numeric]\
    .apply(pd.to_numeric, errors='coerce')

print(df_sunshine.head())

               JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC
City-State                                                               
MONTGOMERY,AL   47   55   58   64   63   64   61   61   59   63   55   49
ANCHORAGE,AK    43   46   51   50   51   46   43   43   41   36   35   33
JUNEAU,AK       39   35   38   42   44   37   33   35   27   21   26   21
NOME,AK         38   56   54   52   52   43   39   34   38   35   30   36
FLAGSTAFF,AZ    71   73   72   82   83   88   74   75   79   77   72   76


In [27]:
print(df_sunshine.describe())

              JAN         FEB         MAR         APR         MAY         JUN  \
count  152.000000  152.000000  152.000000  152.000000  152.000000  152.000000   
mean    51.039474   56.013158   59.164474   61.625000   62.921053   66.302632   
std     11.677159    9.966159   10.116830   10.474701   10.261822   11.132578   
min     20.000000   28.000000   31.000000   36.000000   37.000000   31.000000   
25%     43.000000   50.000000   52.000000   55.000000   58.000000   61.000000   
50%     51.000000   56.000000   59.000000   59.000000   61.000000   65.500000   
75%     58.000000   62.000000   65.000000   67.000000   66.250000   72.000000   
max     80.000000   83.000000   87.000000   92.000000   94.000000   95.000000   

              JUL         AUG         SEP         OCT         NOV         DEC  
count  152.000000  152.000000  152.000000  152.000000  152.000000  152.000000  
mean    68.848684   67.315789   64.217105   60.184211   50.105263   47.309211  
std     11.375447   11.030614 

All the maximum and minimum values look reasonable. For example, there was a city where the sunshine hours were only 16 hours in December, whereas another city observed 97 sunshine hours in July. Remember, the United States is located in the northern hemisphere, and day time is longer in summer.

Take a look at the cities of the highest sunshine hours in July and the lowest hours in December:

In [28]:
print(df_sunshine.query('JUL == 97.0 | DEC == 16.0')[['JUL','DEC']])

               JUL  DEC
City-State             
SACRAMENTO,CA   97   47
QUILLAYUTE,WA   42   16


In [29]:
del df_sunshine_original

### 3.2 &emsp; Data Set 2: Photovoltaic

In [30]:
df_photovoltaic = df_photovoltaic_original[[
    'case_id', 'p_county', 'p_state', 'p_cap_ac', 'p_cap_dc'
    ]].set_index('case_id')

df_photovoltaic.rename(
    columns={
        'p_cap_ac': 'AC',
        'p_cap_dc': 'DC'
        },
    inplace=True
    )

df_photovoltaic['County-State'] = df_photovoltaic['p_county'].str.upper() \
    + ',' +  df_photovoltaic['p_state']

del df_photovoltaic['p_county']
del df_photovoltaic['p_state']
# del df_photovoltaic_original

print(df_photovoltaic.head())

           AC     DC          County-State
case_id                                   
406374    6.0    8.4  MATANUSKA-SUSITNA,AK
405016    1.7    3.4   NORTHWEST ARCTIC,AK
401476   75.0  100.2         LAUDERDALE,AL
401865   10.6   12.7               DALE,AL
401866    7.4    9.7            CALHOUN,AL


In [31]:
df_photovoltaic = df_photovoltaic.groupby('County-State').sum()

In [32]:
df_photovoltaic['Current'] = df_photovoltaic['AC'] + df_photovoltaic['DC']
df_photovoltaic['Current_log'] = np.log(
    df_photovoltaic['AC'] + df_photovoltaic['DC']
)

print(df_photovoltaic.describe())

                AC           DC      Current  Current_log
count  1100.000000  1100.000000  1100.000000  1100.000000
mean     90.132727   112.664182   202.796909     3.840443
std     257.494335   316.062268   573.237206     1.775106
min       0.300000     1.000000     1.800000     0.587787
25%       4.500000     5.700000    10.175000     2.319925
50%      20.000000    26.900000    47.250000     3.855452
75%      80.000000   104.075000   185.250000     5.221697
max    3690.600000  4498.200000  8180.200000     9.009472


The alternate current ("AC") and the direct current ("DC") look so similar that the combined current ("Current") should be good enough to assess and predict the solar power in every city. You can remove the AC and DC from this DataFrame.

In [33]:
del df_photovoltaic['AC']
del df_photovoltaic['DC']
del df_photovoltaic['Current']

### 3.3 &emsp; Dataset 3: Land Values

In this original data set, there are many columns of historical prices of the average home values of 3-bedroom houses but you only need the latest values: `"2025-05-31"`.

See the first few rows of the original data set while excluding all the other columns for historical home values:

In [34]:
df_landvalue = df_landvalue_original.copy()[[
    'RegionID', 'State', 'RegionName', 'CountyName', '2025-05-31'
    ]].set_index('RegionID')

df_landvalue.rename(columns={'2025-05-31': 'Land Value'}, inplace=True)

df_landvalue['County-State']\
    = df_landvalue['CountyName']\
        .str.replace(r'\s* County$', '', regex=True)\
        .str.upper() + ',' + df_landvalue['State']
del df_landvalue['CountyName']

df_landvalue['City-State'] \
    = df_landvalue['RegionName'].str.upper() + ',' + df_landvalue['State']
del df_landvalue['RegionName']

del df_landvalue['State']

print(df_landvalue.head())
print()
print(df_landvalue.shape[0], "rows")

             Land Value    County-State      City-State
RegionID                                               
6181      840048.900964       QUEENS,NY     NEW YORK,NY
12447     964249.977821  LOS ANGELES,CA  LOS ANGELES,CA
39051     253134.060059       HARRIS,TX      HOUSTON,TX
17426     336756.496352         COOK,IL      CHICAGO,IL
6915      235986.092899        BEXAR,TX  SAN ANTONIO,TX

15711 rows


In [35]:
del df_landvalue_original

### 3.4 &emsp; Loading and optimizing the city data set

In [36]:
df_city = df_city_original.copy()\
    [['city','state_id','county_name','lat','lng','population','density']]

print(df_city.head())

          city state_id  county_name      lat       lng  population  density
0     New York       NY       Queens  40.6943  -73.9249    18832416  10943.7
1  Los Angeles       CA  Los Angeles  34.1141 -118.4068    11885717   3165.7
2      Chicago       IL         Cook  41.8375  -87.6866     8489066   4590.3
3        Miami       FL   Miami-Dade  25.7840  -80.2101     6113982   4791.1
4      Houston       TX       Harris  29.7860  -95.3885     6046392   1386.2


In [37]:
df_city['City-State'] \
    = df_city['city'].str.upper()        + ',' + df_city['state_id']
df_city['County-State'] \
    = df_city['county_name'].str.upper() + ',' + df_city['state_id']

del df_city['city']
del df_city['county_name']
del df_city['state_id']

In [38]:
print(df_city['City-State'].value_counts())

City-State
OAKWOOD,OH           3
SAN ANTONIO,PR       3
OAKLAND,PA           3
GEORGETOWN,PA        3
MIDWAY,FL            3
                    ..
BOUTTE,LA            1
BEDFORD HILLS,NY     1
BOWLING GREEN,FL     1
PIRU,CA              1
FALCON VILLAGE,TX    1
Name: count, Length: 31183, dtype: int64


In [39]:
print(df_city.query('`City-State` == "OAKWOOD,OH"'))

           lat      lng  population  density  City-State   County-State
4258   39.7202 -84.1734        9480   1667.9  OAKWOOD,OH  MONTGOMERY,OH
8100   41.3669 -81.5036        3526    394.7  OAKWOOD,OH    CUYAHOGA,OH
20508  41.0927 -84.3747         443    243.5  OAKWOOD,OH    PAULDING,OH


In [40]:
df_city = df_city\
    .sort_values('population', ascending=False)\
    .drop_duplicates('City-State')

print(df_city.query('`City-State` == "OAKWOOD,OH"'))

          lat      lng  population  density  City-State   County-State
4258  39.7202 -84.1734        9480   1667.9  OAKWOOD,OH  MONTGOMERY,OH


In [41]:
print(df_city['City-State'].value_counts())

City-State
NEW YORK,NY          1
SKELLYTOWN,TX        1
PETER,UT             1
MAMMOTH,PA           1
ELMORA,PA            1
                    ..
ESTILL,SC            1
SAND HILL,PA         1
MAUNAWILI,HI         1
HANAPEPE,HI          1
FALCON VILLAGE,TX    1
Name: count, Length: 31183, dtype: int64


Now you can use the `City-State` column for the index of the city DataFrame.

In [42]:
df_city.set_index(['City-State'], inplace=True)

In [43]:
print(df_city.head())

                    lat       lng  population  density    County-State
City-State                                                            
NEW YORK,NY     40.6943  -73.9249    18832416  10943.7       QUEENS,NY
LOS ANGELES,CA  34.1141 -118.4068    11885717   3165.7  LOS ANGELES,CA
CHICAGO,IL      41.8375  -87.6866     8489066   4590.3         COOK,IL
MIAMI,FL        25.7840  -80.2101     6113982   4791.1   MIAMI-DADE,FL
HOUSTON,TX      29.7860  -95.3885     6046392   1386.2       HARRIS,TX


## 4 &emsp; Combining Four DataFrames into One

In [44]:
df_solar = df_landvalue.copy()

df_solar = pd.merge(df_solar, df_photovoltaic, on='County-State', how='outer')
df_solar.drop(['County-State'], axis=1, inplace=True)

df_solar = pd.merge(df_solar, df_sunshine,     on='City-State',   how='inner')

df_solar = pd.merge(df_solar, df_city,         on='City-State',   how='inner')
df_solar.drop(['County-State'], axis=1, inplace=True)

df_solar.set_index('City-State', inplace=True)

print(df_solar.head())

print(df_solar.shape)

                  Land Value  Current_log  JAN  FEB  MAR  APR  MAY  JUN  JUL  \
City-State                                                                     
BOISE,ID       474416.339219     4.945919   32   49   66   68   74   76   85   
ALBANY,NY      314400.663535     4.729156   46   52   51   55   53   55   62   
PITTSBURGH,PA  237749.893040     1.887070   28   37   42   46   48   52   54   
FORT WAYNE,IN  227793.615240          NaN   50   55   57   63   69   74   76   
ALPENA,MI      191528.977768          NaN   36   43   51   55   59   62   66   

               AUG  SEP  OCT  NOV  DEC      lat       lng  population  density  
City-State                                                                      
BOISE,ID        82   80   69   41   34  43.6005 -116.2308      449428   1068.7  
ALBANY,NY       58   54   46   33   36  42.6664  -73.7987      602242   1805.4  
PITTSBURGH,PA   51   52   46   31   23  40.4397  -79.9763     1712828   2117.1  
FORT WAYNE,IN   75   67   60   40 

In [45]:
print("df_solar         ...", df_solar.shape[0])
print("Current_log is not null ...", df_solar.query('Current_log.notnull()').shape[0])
print("Current_log is     null ...", df_solar.query('Current_log.isnull()' ).shape[0])

df_solar         ... 133
Current_log is not null ... 84
Current_log is     null ... 49


In [46]:
df_solar.info()

<class 'pandas.core.frame.DataFrame'>
Index: 133 entries, BOISE,ID to BILLINGS,MT
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Land Value   133 non-null    float64
 1   Current_log  84 non-null     float64
 2   JAN          133 non-null    int64  
 3   FEB          133 non-null    int64  
 4   MAR          133 non-null    int64  
 5   APR          133 non-null    int64  
 6   MAY          133 non-null    int64  
 7   JUN          133 non-null    int64  
 8   JUL          133 non-null    int64  
 9   AUG          133 non-null    int64  
 10  SEP          133 non-null    int64  
 11  OCT          133 non-null    int64  
 12  NOV          133 non-null    int64  
 13  DEC          133 non-null    int64  
 14  lat          133 non-null    float64
 15  lng          133 non-null    float64
 16  population   133 non-null    int64  
 17  density      133 non-null    float64
dtypes: float64(5), int64(13)
memory usage: 2

## 5 &emsp; Splitting the DataFrame for Training and Testing

In [47]:
X_train, X_test, y_train, y_test = train_test_split(
    df_solar.query('Current_log.notnull()').drop(['Current_log'], axis=1),
    df_solar.query('Current_log.notnull()')['Current_log'],
    random_state=42
    )

X_predict \
    = df_solar.query('Current_log.isnull()' ).drop(['Current_log'], axis=1)

In [48]:
print(X_predict.head())

                  Land Value  JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  \
City-State                                                                  
FORT WAYNE,IN  227793.615240   50   55   57   63   69   74   76   75   67   
ALPENA,MI      191528.977768   36   43   51   55   59   62   66   60   52   
ANCHORAGE,AK   392428.040683   43   46   51   50   51   46   43   43   41   
BALTIMORE,MD   206499.969440   50   58   55   57   55   60   63   61   57   
POCATELLO,ID   317148.241331   38   55   63   64   68   74   82   80   78   

               OCT  NOV  DEC      lat       lng  population  density  
City-State                                                            
FORT WAYNE,IN   60   40   36  41.0888  -85.1436      345279    919.8  
ALPENA,MI       41   28   25  45.0740  -83.4402       10178    480.9  
ANCHORAGE,AK    36   35   33  61.1508 -149.1091      289069     65.4  
BALTIMORE,MD    56   50   47  39.3051  -76.6144     2189589   2753.1  
POCATELLO,ID    68   43   36  42.8

## 6 &emsp; Predicting Current by Using Logistic Regression 

In [49]:
from sklearn.metrics import mean_squared_error

from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model  import LinearRegression, Ridge
from sklearn.neighbors     import KNeighborsRegressor
from sklearn.tree          import DecisionTreeRegressor
from sklearn.svm           import SVR

from sklearn.ensemble import VotingRegressor

from sklearn.inspection import permutation_importance

## Try-it 20_1

In [50]:
reg_linear = LinearRegression()
reg_knn    = KNeighborsRegressor()
reg_tree   = DecisionTreeRegressor(random_state=42)
reg_ridge  = Ridge()

# Create pipelines for each regressor
pipelines = {
    'LinearRegression()':      Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', reg_linear)
        ]),
    'KNeighborsRegressor()':   Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', reg_knn)
        ]),
    'DecisionTreeRegressor()': Pipeline([
        ('regressor', reg_tree)
        ]),
    'Ridge()':                 Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', reg_ridge)
        ]),
    'SVR()':                   Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', SVR())
        ])
}

# Define the Voting Regressor
voting_reg = VotingRegressor(estimators=[
    ('LinearRegression()',      pipelines['LinearRegression()']),
    ('KNeighborsRegressor()',   pipelines['KNeighborsRegressor()']),
    ('DecisionTreeRegressor()', pipelines['DecisionTreeRegressor()']),
    ('Ridge()',                 pipelines['Ridge()']),
    ('SVR()',                   pipelines['SVR()'])
])

In [51]:
def evaluate_model(model, X_test, y_test, caps = False):
    y_pred = model.predict(X_test)
    
    if caps:
        y_min = 1.0  # Don't allow negative current
        y_pred[y_pred < y_min] = y_min

        y_max = df_solar['Current_log'].max()
        y_pred[y_pred > y_max] = y_max

    mse = mean_squared_error(y_test, y_pred)
    return mse

Cross-Validation (CV) Grid Search

In [52]:
# Define parameter grids for Grid Search
param_grids = {
    'LinearRegression()':      {},
    'KNeighborsRegressor()':   {'regressor__n_neighbors': [3, 5, 7]},
    'DecisionTreeRegressor()': {'regressor__max_depth':   [3, 5, 7]},
    'Ridge()':                 {'regressor__alpha':       [0.1, 1.0, 10.0]},
    'SVR()':                   {
                                'regressor__C':           [0.1, 1.0, 10.0],
                                'regressor__gamma':       ['scale', 'auto']
                               }
}

# Perform Grid Search and evaluate models
results_after_gs = {}
best_pipelines   = {}

for name, pipeline in pipelines.items():
    grid_search = GridSearchCV(
        pipeline, param_grids[name], cv=2, scoring='neg_mean_squared_error'
        )
    grid_search.fit(X_train, y_train)
    best_pipeline = grid_search.best_estimator_
    best_pipelines[name] = best_pipeline
    mse = evaluate_model(best_pipeline, X_test, y_test)
    results_after_gs[name] = mse

# Define the optimized Voting Regressor
optimized_voting_reg = VotingRegressor(estimators=[
    ('LinearRegression()',      best_pipelines['LinearRegression()']),
    ('KNeighborsRegressor()',   best_pipelines['KNeighborsRegressor()']),
    ('DecisionTreeRegressor()', best_pipelines['DecisionTreeRegressor()']),
    ('Ridge()',                 best_pipelines['Ridge()']),
    ('SVR()',                   best_pipelines['SVR()'])
])

# Fit the optimized Voting Regressor
optimized_voting_reg.fit(X_train, y_train)
voting_mse_after_gs = evaluate_model(optimized_voting_reg, X_test, y_test)
results_after_gs['VotingRegressor()'] = voting_mse_after_gs

# Print results after grid search
print("Results After Grid Search:")
for name, mse in results_after_gs.items():
    print("MSE: ", name, "...", round(mse,4))

Results After Grid Search:
MSE:  LinearRegression() ... 4.1537
MSE:  KNeighborsRegressor() ... 3.0855
MSE:  DecisionTreeRegressor() ... 4.432
MSE:  Ridge() ... 3.991
MSE:  SVR() ... 3.6409
MSE:  VotingRegressor() ... 3.2892


In [53]:
scaled_pipe = Pipeline([
    ('scaler',    StandardScaler()),
    ('regressor', KNeighborsRegressor(n_neighbors=5))
]).fit(X_train,y_train)

y_pred = scaled_pipe.predict(X_test)

print(
    "MSE: KNeighborsRegressor(n_neighbors=3) with Standard Scaler ...",
    round(mean_squared_error(y_test, y_pred),4)
    )

print()

print(
    "Actual Current_log vs. Predicted Current_log:",
    "KNeighborsRegressor(n_neighbors=5)\n\n",
    pd.concat(
        [
            X_test.reset_index()['City-State'],
            y_test.reset_index(drop=True),
            pd.DataFrame(y_pred, columns=['Current-Pred']).reset_index(drop=True)
        ],
        axis=1
    ).head(10)
)

MSE: KNeighborsRegressor(n_neighbors=3) with Standard Scaler ... 2.9292

Actual Current_log vs. Predicted Current_log: KNeighborsRegressor(n_neighbors=5)

           City-State  Current_log  Current-Pred
0         WICHITA,KS     0.875469      2.803656
1           BOISE,ID     4.945919      3.333655
2        PORTLAND,OR     1.223775      3.915064
3       NASHVILLE,TN     1.223775      2.771918
4     GREAT FALLS,MT     1.960095      3.532627
5  SALT LAKE CITY,UT     1.064711      3.773876
6     BROWNSVILLE,TX     6.042633      4.196805
7        PORTLAND,ME     4.464758      4.716337
8     ALBUQUERQUE,NM     4.642466      4.465365
9      PROVIDENCE,RI     5.725544      4.075468


In [54]:
y_pred = optimized_voting_reg.predict(X_test)

print(
    "MSE: KNeighborsRegressor(n_neighbors=3) with Standard Scaler ...",
    round(mean_squared_error(y_test, y_pred),4)
    )

print()

print(
    "Actual Current_log vs. Predicted Current_log: KNeighborsRegressor(n_neighbors=3))",
    "\n\n",
    pd.concat(
        [
            X_test.reset_index()['City-State'],
            y_test.reset_index(drop=True),
            pd.DataFrame(y_pred, columns=['Current_predict']).reset_index(drop=True)
        ],
        axis=1
    ).head(10)
)

MSE: KNeighborsRegressor(n_neighbors=3) with Standard Scaler ... 3.2892

Actual Current_log vs. Predicted Current_log: KNeighborsRegressor(n_neighbors=3)) 

           City-State  Current_log  Current_predict
0         WICHITA,KS     0.875469         3.533957
1           BOISE,ID     4.945919         4.861531
2        PORTLAND,OR     1.223775         4.447136
3       NASHVILLE,TN     1.223775         2.948035
4     GREAT FALLS,MT     1.960095         4.288323
5  SALT LAKE CITY,UT     1.064711         4.526865
6     BROWNSVILLE,TX     6.042633         4.595649
7        PORTLAND,ME     4.464758         3.832136
8     ALBUQUERQUE,NM     4.642466         5.580022
9      PROVIDENCE,RI     5.725544         3.863643


---
Coefficient analysis

In [55]:
print("Permutation Importance from KNeighborsRegressor(n_neighbors=5):")
result_knn = permutation_importance(
    best_pipelines['KNeighborsRegressor()'], 
    X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
    )

perm_importances_knn = pd.DataFrame({
    'feature':    X_test.columns,
    'importance': result_knn.importances_mean
    })

perm_importances_knn = perm_importances_knn\
    .sort_values(by='importance', ascending=False)

print(perm_importances_knn)

Permutation Importance from KNeighborsRegressor(n_neighbors=5):
       feature  importance
13         lat    0.059882
11         NOV    0.032363
7          JUL    0.031565
0   Land Value    0.027916
5          MAY    0.024683
16     density    0.024294
2          FEB    0.016571
1          JAN    0.012489
15  population    0.009800
4          APR    0.005774
8          AUG    0.003775
3          MAR    0.001454
12         DEC   -0.006241
6          JUN   -0.009242
10         OCT   -0.010082
14         lng   -0.010163
9          SEP   -0.012938


In [56]:
print("Feature Importance from Decision Tree:")
reg_tree.fit(X_train, y_train)

importances_tree = reg_tree.feature_importances_

feature_importance_tree_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': importances_tree
    })

feature_importance_tree_df = feature_importance_tree_df\
    .sort_values(by='importance', ascending=False)
print(feature_importance_tree_df)

Feature Importance from Decision Tree:
       feature  importance
4          APR    0.434568
13         lat    0.200770
16     density    0.111026
14         lng    0.067798
3          MAR    0.050087
1          JAN    0.049426
0   Land Value    0.022672
15  population    0.021762
7          JUL    0.016298
6          JUN    0.009972
2          FEB    0.003748
10         OCT    0.003428
8          AUG    0.003300
12         DEC    0.003186
9          SEP    0.001090
5          MAY    0.000626
11         NOV    0.000243


In [57]:
print("Coefficients from Ridge Regression:")

reg_ridge.fit(X_train, y_train)
coefficients_ridge = reg_ridge.coef_
coefficients_ridge_df = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': coefficients_ridge
    })
coefficients_ridge_df = coefficients_ridge_df\
    .sort_values(by='coefficient', ascending=False)

print(coefficients_ridge_df)

Coefficients from Ridge Regression:
       feature   coefficient
4          APR  1.570783e-01
9          SEP  1.525233e-01
3          MAR  8.172905e-02
1          JAN  8.143976e-02
8          AUG  3.154571e-02
15  population  6.287081e-08
0   Land Value -1.063750e-06
16     density -2.517912e-04
14         lng -6.228829e-04
7          JUL -1.406822e-03
2          FEB -8.395741e-03
10         OCT -2.592279e-02
6          JUN -5.341511e-02
11         NOV -7.374099e-02
12         DEC -9.358957e-02
5          MAY -1.298801e-01
13         lat -1.359069e-01


In [58]:
print("Permutation Importance from SVR():")

result_svr = permutation_importance(
    best_pipelines['SVR()'],
    X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
    )

perm_importances_svr = pd.DataFrame({
    'feature':    X_test.columns,
    'importance': result_svr.importances_mean
    })

perm_importances_svr = perm_importances_svr\
    .sort_values(by='importance', ascending=False)

print(perm_importances_svr)

Permutation Importance from SVR():
       feature  importance
13         lat    0.087710
6          JUN    0.018060
14         lng    0.015295
15  population    0.014443
2          FEB    0.012959
16     density    0.007483
0   Land Value    0.004602
12         DEC    0.003899
1          JAN    0.000625
11         NOV   -0.004155
5          MAY   -0.008077
3          MAR   -0.012850
4          APR   -0.015995
7          JUL   -0.024847
10         OCT   -0.027957
9          SEP   -0.030518
8          AUG   -0.033364


## Predict capable current in new cities

In [59]:
X = df_solar.query('Current_log.notnull()').drop(['Current_log'], axis=1)
y = df_solar.query('Current_log.notnull()')['Current_log']

scaled_pipe = Pipeline([
    ('scaler',    StandardScaler()),
    ('regressor', KNeighborsRegressor(n_neighbors=5))
]).fit(X,y)

y_pred = scaled_pipe.predict(X_predict)

In [60]:

df_pred = X_predict.reset_index()[[
    'City-State','Land Value','population','density','lat','JUN','NOV'
    ]].set_index('City-State')

df_pred['Land Value'] = df_pred['Land Value'].astype(int)

df_pred['Current_log'] = y_pred
df_pred['Current']     = np.exp(y_pred)
df_pred

Unnamed: 0_level_0,Land Value,population,density,lat,JUN,NOV,Current_log,Current
City-State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"FORT WAYNE,IN",227793,345279,919.8,41.0888,74,40,2.555058,12.872041
"ALPENA,MI",191528,10178,480.9,45.074,62,28,2.735924,15.423987
"ANCHORAGE,AK",392428,289069,65.4,61.1508,46,35,4.155131,63.760312
"BALTIMORE,MD",206499,2189589,2753.1,39.3051,60,50,4.429131,83.858493
"POCATELLO,ID",317148,76370,649.5,42.8724,74,43,3.69142,40.101734
"HURON,SD",195944,14347,569.2,44.3623,69,51,2.904322,18.252867
"GREEN BAY,WI",272902,227679,904.9,44.5148,66,35,3.292618,26.913227
"BISMARCK,ND",326276,99060,825.4,46.8143,67,42,2.815611,16.703377
"SHREVEPORT,LA",146103,278269,655.0,32.4653,68,57,3.021222,20.516344
"LAKE CHARLES,LA",211510,145110,664.4,30.201,80,55,3.967367,52.845232


In [61]:
df_pred[['Current_log','Current']].describe()

Unnamed: 0,Current_log,Current
count,49.0,49.0
mean,3.403162,44.884203
std,0.819981,57.107587
min,2.156944,8.644683
25%,2.767559,15.919727
50%,3.301391,27.15038
75%,3.754127,42.696949
max,5.891352,361.894237


In [62]:
print(df_pred.sort_values(by='Current', ascending=False).head())

              Land Value  population  density      lat  JUN  NOV  Current_log  \
City-State                                                                      
KEY WEST,FL      1285405       25824   1780.7  24.5642   77   71     5.891352   
JACKSON,MS         85274      331332    518.0  32.3157   70   55     5.082682   
FLAGSTAFF,AZ      650010       77868    446.4  35.1872   88   72     4.864624   
AMARILLO,TX       211068      205100    748.8  35.1984   79   67     4.683935   
ELY,NV            255374        3941    199.5  39.2650   81   65     4.683935   

                 Current  
City-State                
KEY WEST,FL   361.894237  
JACKSON,MS    161.205909  
FLAGSTAFF,AZ  129.622143  
AMARILLO,TX   108.194995  
ELY,NV        108.194995  


In [63]:
print(df_pred.sort_values(by='Current', ascending=True).head())

               Land Value  population  density      lat  JUN  NOV  \
City-State                                                          
OMAHA,NE           278435      826161   1318.9  41.2627   72   49   
DODGE CITY,KS      237736       27652    711.1  37.7611   77   63   
CONCORDIA,KS       143484        5067    434.4  39.5669   78   59   
TULSA,OK           217766      740620    805.1  36.1283   66   50   
DES MOINES,IA      227345      560170    930.5  41.5725   69   45   

               Current_log    Current  
City-State                             
OMAHA,NE          2.156944   8.644683  
DODGE CITY,KS     2.316172  10.136799  
CONCORDIA,KS      2.316172  10.136799  
TULSA,OK          2.375974  10.761488  
DES MOINES,IA     2.446981  11.553411  
