# Capstone Assignment 20.1: Initial Report and Exploratory Data Analysis (EDA)

Nathan Oyama

## 1 &emsp; Planning the project

There are three data sets to accomplish this project. 

* Kaggle: "Percent Sunshine by US City". 18 Jan 2023. kaggle.com/datasets/thedevastator/annual-percent-of-possible-sunshine-by-us-city.

* US Geological Survey: "The United States Large-Scale Solar Photovoltaic Database (USPVDB)". 28 Apr 2025. US Department of the Interior. energy.usgs.gov/uspvdb/data.

* landvalue: "ZHVI 3-Bedroom Time Series($) - City". 25 Jun 2025. landvalue.com/research/data.

* Pareto Software: "United States Cities Database - Basic". 9 Jun 2025. simplemaps.com/data/us-cities.

Then take the following steps _for every dataset_:

1. From the data set which is in CSV format, create a pandas DataFrame object.
1. Analyze every DataFrame and identify which columns to use for this project.
1. Format the DataFrames before merge them.

Finally, merge the three DataFrames into one.

## 2 &emsp; Analyzing Data Sets

Analyze those three data sets.

In [1]:
import pandas as pd
import re

pd.options.mode.copy_on_write = True

from sklearn.model_selection import train_test_split, GridSearchCV

import warnings

warnings.filterwarnings("ignore", message=".*pkg_resources is deprecated as an API.*")
warnings.filterwarnings("ignore", category=UserWarning)

### 2.1 &emsp; Analyzing Data Set 1: "Percent Sunshine by US City"

In [2]:
df_sunshine_original = pd.read_csv(
    './data/Average Percent of Possible Sunshine by US City.csv'
    )

print(df_sunshine_original.head())

   index           CITY JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ANN  \
0      0  BIRMINGHAM,AL  46  53  57  65  65  67  59  62  59  66  55  49  58   
1      1  MONTGOMERY,AL  47  55  58  64  63  64  61  61  59  63  55  49  58   
2      2   ANCHORAGE,AK  43  46  51  50  51  46  43  43  41  36  35  33  43   
3      3      JUNEAU,AK  39  35  38  42  44  37  33  35  27  21  26  21  33   
4      4        NOME,AK  38  56  54  52  52  43  39  34  38  35  30  36  42   

   Unnamed: 14  
0          NaN  
1          NaN  
2          NaN  
3          NaN  
4          NaN  


See the first few records of the original DataFrame for sunshine hours. You can ignore and discard some unimportant columns: "index" and "Unnamed: 14".

The remaining columns are the "CITY" column, the columns of all 12 months such as "JAN" and "FEB", and the annual. The "CITY" column includes the name of the city in all uppercase, followed by a comma (",") and the state abbreviation. You can use this column for the index of this DataFrame. The column for each month represents the number of sunshine hours of the month for every city. The "ANN" field represents the average of those monthly sunshine hours. For example, the city of Birmingham, Alabama observed approximately 46 sunshine hours in January; in average, Birmingham obseerved approximately 58 sunshine hours per month.

In [3]:
df_sunshine_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   index        317 non-null    int64  
 1   CITY         317 non-null    object 
 2   JAN          307 non-null    object 
 3   FEB          309 non-null    object 
 4   MAR          309 non-null    object 
 5   APR          309 non-null    object 
 6   MAY          311 non-null    object 
 7   JUN          311 non-null    object 
 8   JUL          317 non-null    object 
 9   AUG          317 non-null    object 
 10  SEP          317 non-null    object 
 11  OCT          317 non-null    object 
 12  NOV          317 non-null    object 
 13  DEC          313 non-null    object 
 14  ANN          307 non-null    object 
 15  Unnamed: 14  0 non-null      float64
dtypes: float64(1), int64(1), object(14)
memory usage: 39.8+ KB


### 2.2 &emsp; Analyzing Data Set 2: "The US Large-Scale Solar Photovoltaic Database (USPVDB)"

In [4]:
df_photovoltaic_original = pd.read_csv(
    './data/uspvdb_v3_0_20250430.csv'
    )

print(df_photovoltaic_original.shape)

(5712, 26)


In [5]:
print(df_photovoltaic_original.iloc[:,:13].head())

   case_id multi_poly  eia_id p_state           p_county       ylat  \
0   406374     single   66887      AK  Matanuska-Susitna  61.587349   
1   405016      multi    6304      AK   Northwest Arctic  66.838470   
2   401476      multi   60058      AL         Lauderdale  34.833809   
3   401865      multi   60679      AL               Dale  31.331732   
4   401866      multi   60680      AL            Calhoun  33.626301   

        xlong   p_area  p_img_date  p_dig_conf                   p_name  \
0 -149.789413   172005    20240814           4            Houston Solar   
1 -162.553146     8740    20240719           4          Kotzebue Hybrid   
2  -87.838394  1735134    20220212           4    River Bend Solar, LLC   
3  -85.729469   187820    20220609           4  Fort Rucker Solar Array   
4  -85.940590    39717    20210814           4         ANAD Solar Array   

   p_year p_pwr_reg  
0    2023        AK  
1    2020       NaN  
2    2016       TVA  
3    2017      SOCO  
4    2017   

In [6]:
print(df_photovoltaic_original.iloc[:,13:26].head())

  p_tech_pri p_tech_sec p_sys_type       p_axis  p_azimuth  p_tilt  p_battery  \
0         PV        NaN     ground   fixed-tilt      180.0    40.0        NaN   
1         PV        NaN     ground  single-axis      156.0    40.0  batteries   
2         PV       c-si     ground  single-axis      270.0    17.0        NaN   
3         PV  thin-film     ground  single-axis      188.0    20.0        NaN   
4         PV  thin-film     ground   fixed-tilt      180.0    20.0        NaN   

   p_cap_ac  p_cap_dc      p_type       p_agrivolt p_comm  p_zscore  
0       6.0       8.4  greenfield             crop    NaN -0.457675  
1       1.7       3.4  greenfield  non-agrivoltaic    NaN  5.617232  
2      75.0     100.2  greenfield  non-agrivoltaic    NaN -0.298527  
3      10.6      12.7  greenfield  non-agrivoltaic    NaN -0.122265  
4       7.4       9.7   superfund  non-agrivoltaic    NaN  3.031619  


In [7]:
df_photovoltaic_original.query('p_cap_ac.isnull() | p_cap_dc.isnull()').shape

(0, 26)

### 2.3 &emsp; Analyzing Data Set 3: "ZHVI 3-Bedroom Time Series($) - City"

In [8]:
df_landvalue_original = pd.read_csv(
    './data/City_zhvi_bdrmcnt_3_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv.zip',
    compression='zip'
    )

df_landvalue_original.columns

Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
       'State', 'Metro', 'CountyName', '2000-01-31', '2000-02-29',
       ...
       '2024-08-31', '2024-09-30', '2024-10-31', '2024-11-30', '2024-12-31',
       '2025-01-31', '2025-02-28', '2025-03-31', '2025-04-30', '2025-05-31'],
      dtype='object', length=313)

In [9]:
df_landvalue_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15711 entries, 0 to 15710
Columns: 313 entries, RegionID to 2025-05-31
dtypes: float64(305), int64(2), object(6)
memory usage: 37.5+ MB


In [10]:
print(
    df_landvalue_original[[
        'RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
        'State', 'Metro', 'CountyName', '2025-05-31'
        ]].head()
    )

   RegionID  SizeRank   RegionName RegionType StateName State  \
0      6181         0     New York       city        NY    NY   
1     12447         1  Los Angeles       city        CA    CA   
2     39051         2      Houston       city        TX    TX   
3     17426         3      Chicago       city        IL    IL   
4      6915         4  San Antonio       city        TX    TX   

                                   Metro          CountyName     2025-05-31  
0  New York-Newark-Jersey City, NY-NJ-PA       Queens County  840048.900964  
1     Los Angeles-Long Beach-Anaheim, CA  Los Angeles County  964249.977821  
2   Houston-The Woodlands-Sugar Land, TX       Harris County  253134.060059  
3     Chicago-Naperville-Elgin, IL-IN-WI         Cook County  336756.496352  
4          San Antonio-New Braunfels, TX        Bexar County  235986.092899  


### 2.4 &emsp; City information

This data set contains the name of most cities in the United States and data fields for every city: the name, longitude, latitude, population density, and so on.

In [11]:
df_city_original = pd.read_csv(
    './data/uscities.csv'
    )

print(df_city_original.head())
print("")
print("df_city_original.shape ...", df_city_original.shape)

          city   city_ascii state_id  state_name  county_fips  county_name  \
0     New York     New York       NY    New York        36081       Queens   
1  Los Angeles  Los Angeles       CA  California         6037  Los Angeles   
2      Chicago      Chicago       IL    Illinois        17031         Cook   
3        Miami        Miami       FL     Florida        12086   Miami-Dade   
4      Houston      Houston       TX       Texas        48201       Harris   

       lat       lng  population  density source  military  incorporated  \
0  40.6943  -73.9249    18832416  10943.7  shape     False          True   
1  34.1141 -118.4068    11885717   3165.7  shape     False          True   
2  41.8375  -87.6866     8489066   4590.3  shape     False          True   
3  25.7840  -80.2101     6113982   4791.1  shape     False          True   
4  29.7860  -95.3885     6046392   1386.2  shape     False          True   

              timezone  ranking  \
0     America/New_York        1   
1  A

In this data set, there are the `city` column and the `city_ascii` column that you may want to check the difference:

In [12]:
df_city_diff = \
    df_city_original[['city','city_ascii','state_id']]\
    .query('city != city_ascii')

print(
    "The first few rows out of", df_city_diff.shape[0], "rows "
    "whose values in the city field and city_ascii field are different: "
    "\n"
    )

print(df_city_diff.head())

The first few rows out of 76 rows whose values in the city field and city_ascii field are different: 

            city  city_ascii state_id
265      Bayamón     Bayamon       PR
484   San Germán  San German       PR
525     Mayagüez    Mayaguez       PR
752   Juana Díaz  Juana Diaz       PR
2014      Cataño      Catano       PR


There should be many city names that contain Spanish letters on Puerto Rico. Check all cities on Puerto Rico in the sunshine data set:

In [13]:
print(
    df_sunshine_original.iloc[1:158].query('CITY.str.contains(",PR")')['CITY']
    )

157    SAN JUAN,PR
Name: CITY, dtype: object


The sunshine data set contained only one city on Puerto Rico: San Juan. Check how "San Juan" is defined in the city data set:

In [14]:
print(
    df_city_original[['city','city_ascii','state_id']]\
        .query('city_ascii == "San Juan"')
    )

          city city_ascii state_id
29    San Juan   San Juan       PR
1298  San Juan   San Juan       TX


There are actually two cities of San Juan: one on Puerto Rico and the other one in Texas. But neither city name contains non-ascii letter and so you can keep the first record for San Juan, Puerto Rico.

Check all cities containing non-ascii letters outside Puerto Rico:

In [15]:
print(df_city_diff.query('state_id != "PR"'))

                       city            city_ascii state_id
2250   La Cañada Flintridge  La Canada Flintridge       CA
2627             Cañon City            Canon City       CO
3944               Española              Espanola       NM
5238            Piñon Hills           Pinon Hills       CA
11895          César Chávez          Cesar Chavez       TX
14389              Doña Ana              Dona Ana       NM
15643             Cañoncito             Canoncito       NM
17598           Peña Blanca           Pena Blanca       NM
18469               Peñasco               Penasco       NM
19962  Cañada de los Alamos  Canada de los Alamos       NM
20794                 Cañon                 Canon       NM
27154              Salineño              Salineno       TX
28089               Cañones               Canones       NM
28160        Salineño North        Salineno North       TX
29328                Lopeño                Lopeno       TX


These cities are located in California, Colorado, New Mexico, or Texas. Check cities in those states in the sunshine data set:

In [16]:
print(df_sunshine_original.iloc[1:158].query('False \
    |   CITY.str.contains(",CA") \
    |   CITY.str.contains(",CO") \
    |   CITY.str.contains(",NM") \
    |   CITY.str.contains(",TX") \
    ')['CITY'])

10             FRESNO,CA
11        LOS ANGELES,CA
12         SACRAMENTO,CA
13          SAN DIEGO,CA
14      SAN FRANCISCO,CA
15             DENVER,CO
16     GRAND JUNCTION,CO
17             PUEBLO,CO
84        ALBUQUERQUE,NM
85            ROSWELL,NM
123           ABILENE,TX
124          AMARILLO,TX
125            AUSTIN,TX
126       BROWNSVILLE,TX
127    CORPUS CHRISTI,TX
128            DALLAS,TX
129           EL PASO,TX
130           HOUSTON,TX
131           LUBBOCK,TX
132    MIDLAND-ODESSA,TX
133       PORT ARTHUR,TX
134       SAN ANTONIO,TX
Name: CITY, dtype: object


Any of these cities in the sunshine data set appeared in the city data set where the city name contained non-English letters. Therefore, you can remove all records containing non-English letters from the city data set.

### 2.4 &emsp; Planning the merged Data Frame


A combined DataFrame: `df_solar`

| Column               | Example          | Data Sets                          |
| :------------------- | :--------------- | :--------------------------------- |
| County-State         | BERKELEY,CA      | City, Photovoltaic, Land Value     |
| City-State           | ALAMEDA,CA       | City, Sunshine, Land Value         |
| Longitude            | -149.789413.     | City, Photovoltaic                 |
| Latitude             | 61.587349        | Sunshine, Photovoltaic, Land Value |
| ANN                  | 58               | Sunshine                           |
| JAN ... DEC          | 58               | Sunshine                           |
| DC                   | 6.0              | Photovoltaic                       |
| AC                   | 8.4              | Photovoltaic                       |
| Current              | 14.4             | (DC + AC)                          |
| Land Value           | 840048.900963529 | Land Value                         |

Some cities that are listed in the Land Value data set and the Sunshine data set do not have solar power plants. In later steps, you predict the current in cities that do not have solar power plants and tell which cities you should build solar power plants.


Note that in the United States, city names are unique only within the same state. Some cities share the same name and even the same county name while those are located in different states.

| Column | Sunshine | Land Value  | PV      | City    | Example 1  | Example 2  |
| :----- | :------: | :---------: | :-----: | :-----: | :--------- | :--------- |
| City   | &#9679;  | &#9679;     | -       | &#9679; | Franklin   | Franklin   |
| County | -        | &#9679;     | &#9679; | &#9679; | Williamson | Williamson |
| State  | &#9679;  | &#9679;     | &#9679; | &#9679; | Tennessee  | Texas      |


### 2.4. Converting Data Frames

The original data sets are stored in CSV format. To load these CSV data sets and convert these to the pandas' DataFrame objects, all numeric entries should be recognized as either integer data type of float data type.

In [17]:
def convert_df_obj_numeric(df):
    cols_obj = df.select_dtypes(include='object').columns
    df[cols_obj] = df[cols_obj].apply(pd.to_numeric, errors='coerce')
    return df

## 3 &emsp; Extracting Data Sets

### 3.1 &emsp; Dataset 1: Sunshine

In [18]:
df_sunshine_original = pd.read_csv(
    './data/Average Percent of Possible Sunshine by US City.csv'
    )

Check the `CITY` column:

In [19]:
print(df_sunshine_original[['CITY']].value_counts())

CITY                  
ABERDEEN,SD               2
PENSACOLA,FL              2
NOME,AK                   2
NORFOLK,VA                2
NORTH PLATTE,NE           2
                         ..
GRAND RAPIDS,MI           2
GREAT FALLS,MT            2
GREEN BAY,WI              2
YAP- W CAROLINE IS.,PC    2
CITY                      1
Name: count, Length: 159, dtype: int64


There is one invalid entry `"CITY"`, and every other city has exact two entries. Check the row where the `CITY` column is `CITY`:

In [20]:
print(df_sunshine_original[df_sunshine_original['CITY'] == 'CITY'])

     index  CITY  JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC  \
158    158  CITY  JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC   

     ANN  Unnamed: 14  
158  ANN          NaN  


In [21]:
print(df_sunshine_original.sort_values(by=['CITY','index']).head(10))

     index            CITY  JAN  FEB  MAR  APR  MAY  JUN JUL AUG SEP OCT NOV  \
115    115     ABERDEEN,SD  NaN   54   58   63   65   66  74  78  68  48  21   
315    315     ABERDEEN,SD  NaN   54   58   63   65   66  74  78  68  48  21   
123    123      ABILENE,TX   63   66   70   71   71   77  80  75  69  68  64   
182    182      ABILENE,TX   63   66   70   71   71   77  80  75  69  68  64   
86      86       ALBANY,NY   46   52   51   55   53   55  62  58  54  46  33   
287    287       ALBANY,NY   46   52   51   55   53   55  62  58  54  46  33   
84      84  ALBUQUERQUE,NM   73   73   73   78   80   82  76  76  77  80  75   
169    169  ALBUQUERQUE,NM   73   73   73   78   80   82  76  76  77  80  75   
108    108    ALLENTOWN,PA  NaN  NaN  NaN  NaN  NaN  NaN  90  93  82  52  47   
314    314    ALLENTOWN,PA  NaN  NaN  NaN  NaN  NaN  NaN  90  93  82  52  47   

     DEC  ANN  Unnamed: 14  
115  NaN  NaN          NaN  
315  NaN  NaN          NaN  
123   65   69          NaN  
182

In this data set, the row where `'index'` is 158 does not include appopriate values thus you can remvoe it.

See the `"index"` field of every pairs of rows for every city; one of those rows have the value of less than 158 and the other rows have the value of greater than 158. Also all other values such as `"ANN"` are the same.

Assume that this data set includes two tables with the same rows in different orders. The first table spans between line 1 and line 157 in the original CSV file, and the second table spans between line 158 and line 318. You only need the first one.

Construct a new DataFrame that is based on the original data set for the sunshine information with the following changes:

* Include all rows of the first 157 records.
* Rename the `CITY` column `City-State`.
* Change the index column from the `index` column to the `City-State` column.
* Trim off the unnecessary columns: `index` and `"Unnamed: 14"`.

In [22]:
df_sunshine = df_sunshine_original.iloc[1:158]

df_sunshine.rename(columns={'CITY': 'City-State'}, inplace=True)

df_sunshine.set_index(['City-State'], inplace=True)

del df_sunshine['index']
del df_sunshine['Unnamed: 14']

Check rows that include a null value in any field:

In [23]:
print(df_sunshine[df_sunshine.isna().any(axis=1)])

               JAN  FEB  MAR  APR  MAY  JUN JUL AUG SEP OCT NOV  DEC  ANN
City-State                                                               
TUPELO,MS      NaN  NaN  NaN  NaN  NaN  NaN  66  59  61  67  57   45  NaN
CINCINNATI,OH  NaN  NaN  NaN  NaN   47   70  85  76  77  50  44   30  NaN
ALLENTOWN,PA   NaN  NaN  NaN  NaN  NaN  NaN  90  93  82  52  47  NaN  NaN
ABERDEEN,SD    NaN   54   58   63   65   66  74  78  68  48  21  NaN  NaN
ELKINS,WV      NaN  NaN  NaN  NaN  NaN  NaN  69  50  52  43  31   18  NaN


There are 5 rows which constitutes ~3.2% of the whole DataFrame. Because these rows seem almost impossible to find appropriate values to fill in the null fields, you should discard these 5 records.

In [24]:
df_sunshine.dropna(inplace=True)

Convert entries in the numeric columns to numeric data type:

In [25]:
cols_numeric = [
    'JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC',
    'ANN'
    ]

df_sunshine[cols_numeric] = df_sunshine[cols_numeric]\
    .apply(pd.to_numeric, errors='coerce')

print(df_sunshine.head())

               JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC  ANN
City-State                                                                    
MONTGOMERY,AL   47   55   58   64   63   64   61   61   59   63   55   49   58
ANCHORAGE,AK    43   46   51   50   51   46   43   43   41   36   35   33   43
JUNEAU,AK       39   35   38   42   44   37   33   35   27   21   26   21   33
NOME,AK         38   56   54   52   52   43   39   34   38   35   30   36   42
FLAGSTAFF,AZ    71   73   72   82   83   88   74   75   79   77   72   76   76


In [26]:
print(df_sunshine.describe())

              JAN         FEB         MAR         APR         MAY         JUN  \
count  152.000000  152.000000  152.000000  152.000000  152.000000  152.000000   
mean    51.039474   56.013158   59.164474   61.625000   62.921053   66.302632   
std     11.677159    9.966159   10.116830   10.474701   10.261822   11.132578   
min     20.000000   28.000000   31.000000   36.000000   37.000000   31.000000   
25%     43.000000   50.000000   52.000000   55.000000   58.000000   61.000000   
50%     51.000000   56.000000   59.000000   59.000000   61.000000   65.500000   
75%     58.000000   62.000000   65.000000   67.000000   66.250000   72.000000   
max     80.000000   83.000000   87.000000   92.000000   94.000000   95.000000   

              JUL         AUG         SEP         OCT         NOV         DEC  \
count  152.000000  152.000000  152.000000  152.000000  152.000000  152.000000   
mean    68.848684   67.315789   64.217105   60.184211   50.105263   47.309211   
std     11.375447   11.0306

All the maximum and minimum values look reasonable. For example, there was a city where the sunshine hours were only 16 hours in December, whereas another city observed 97 sunshine hours in July. Remember, the United States is located in the northern hemisphere, and day time is longer in summer.

Take a look at the cities of the highest sunshine hours in July and the lowest hours in December:

In [27]:
print(df_sunshine.query('JUL == 97.0 | DEC == 16.0')[['JUL','DEC','ANN']])

               JUL  DEC  ANN
City-State                  
SACRAMENTO,CA   97   47   77
QUILLAYUTE,WA   42   16   32


In [28]:
del df_sunshine_original

### 3.2 &emsp; Data Set 2: Photovoltaic

In [29]:
df_photovoltaic = df_photovoltaic_original[[
    'case_id', 'p_county', 'p_state', 'xlong', 'ylat', 'p_cap_ac', 'p_cap_dc'
    ]].set_index('case_id')

df_photovoltaic.rename(
    columns={
        'xlong':    'Longitude',
        'ylat':     'Latitude',
        'p_cap_ac': 'AC',
        'p_cap_dc': 'DC'
        },
    inplace=True
    )

df_photovoltaic['Current'] = df_photovoltaic['AC'] + df_photovoltaic['DC'] 

df_photovoltaic['County-State'] = df_photovoltaic['p_county'].str.upper() \
    + ',' +  df_photovoltaic['p_state']

del df_photovoltaic['p_county']
del df_photovoltaic['p_state']
del df_photovoltaic_original

print(df_photovoltaic.head())

          Longitude   Latitude    AC     DC  Current          County-State
case_id                                                                   
406374  -149.789413  61.587349   6.0    8.4     14.4  MATANUSKA-SUSITNA,AK
405016  -162.553146  66.838470   1.7    3.4      5.1   NORTHWEST ARCTIC,AK
401476   -87.838394  34.833809  75.0  100.2    175.2         LAUDERDALE,AL
401865   -85.729469  31.331732  10.6   12.7     23.3               DALE,AL
401866   -85.940590  33.626301   7.4    9.7     17.1            CALHOUN,AL


### 3.3 &emsp; Dataset 3: Land Values

In this original data set, there are many columns of historical prices of the average home values of 3-bedroom houses but you only need the latest values: `"2025-05-31"`.

See the first few rows of the original data set while excluding all the other columns for historical home values:

In [30]:
df_landvalue = df_landvalue_original.copy()[[
    'RegionID', 'SizeRank', 'State', 'RegionName', 'CountyName', '2025-05-31'
    ]].set_index('RegionID')

df_landvalue.rename(columns={'2025-05-31': 'Land Value'}, inplace=True)

df_landvalue['County-State']\
    = df_landvalue['CountyName']\
        .str.replace(r'\s* County$', '', regex=True)\
        .str.upper() + ',' + df_landvalue['State']
del df_landvalue['CountyName']

df_landvalue['City-State'] \
    = df_landvalue['RegionName'].str.upper() + ',' + df_landvalue['State']
del df_landvalue['RegionName']

del df_landvalue['State']

del df_landvalue_original
print(df_landvalue.head())

          SizeRank     Land Value    County-State      City-State
RegionID                                                         
6181             0  840048.900964       QUEENS,NY     NEW YORK,NY
12447            1  964249.977821  LOS ANGELES,CA  LOS ANGELES,CA
39051            2  253134.060059       HARRIS,TX      HOUSTON,TX
17426            3  336756.496352         COOK,IL      CHICAGO,IL
6915             4  235986.092899        BEXAR,TX  SAN ANTONIO,TX


In [31]:
df_landvalue.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15711 entries, 6181 to 52600
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SizeRank      15711 non-null  int64  
 1   Land Value    15711 non-null  float64
 2   County-State  15711 non-null  object 
 3   City-State    15711 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 613.7+ KB


### 3.4 &emsp; Loading and optimizing the city data set

In [32]:
df_city = df_city_original.copy()\
    [['city','state_id','county_name','lat','lng','population','density']]

print(df_city.head())

          city state_id  county_name      lat       lng  population  density
0     New York       NY       Queens  40.6943  -73.9249    18832416  10943.7
1  Los Angeles       CA  Los Angeles  34.1141 -118.4068    11885717   3165.7
2      Chicago       IL         Cook  41.8375  -87.6866     8489066   4590.3
3        Miami       FL   Miami-Dade  25.7840  -80.2101     6113982   4791.1
4      Houston       TX       Harris  29.7860  -95.3885     6046392   1386.2


In [33]:
df_city['City-State'] \
    = df_city['city'].str.upper()        + ',' + df_city['state_id']
df_city['County-State'] \
    = df_city['county_name'].str.upper() + ',' + df_city['state_id']

del df_city['city']
del df_city['county_name']
del df_city['state_id']

In [34]:
print(df_city['City-State'].value_counts())

City-State
OAKWOOD,OH           3
SAN ANTONIO,PR       3
OAKLAND,PA           3
GEORGETOWN,PA        3
MIDWAY,FL            3
                    ..
BOUTTE,LA            1
BEDFORD HILLS,NY     1
BOWLING GREEN,FL     1
PIRU,CA              1
FALCON VILLAGE,TX    1
Name: count, Length: 31183, dtype: int64


In [35]:
print(df_city.query('`City-State` == "OAKWOOD,OH"'))

           lat      lng  population  density  City-State   County-State
4258   39.7202 -84.1734        9480   1667.9  OAKWOOD,OH  MONTGOMERY,OH
8100   41.3669 -81.5036        3526    394.7  OAKWOOD,OH    CUYAHOGA,OH
20508  41.0927 -84.3747         443    243.5  OAKWOOD,OH    PAULDING,OH


In [36]:
df_city = df_city\
    .sort_values('population', ascending=False)\
    .drop_duplicates('City-State')

print(df_city.query('`City-State` == "OAKWOOD,OH"'))

          lat      lng  population  density  City-State   County-State
4258  39.7202 -84.1734        9480   1667.9  OAKWOOD,OH  MONTGOMERY,OH


In [37]:
print(df_city['City-State'].value_counts())

City-State
NEW YORK,NY          1
SKELLYTOWN,TX        1
PETER,UT             1
MAMMOTH,PA           1
ELMORA,PA            1
                    ..
ESTILL,SC            1
SAND HILL,PA         1
MAUNAWILI,HI         1
HANAPEPE,HI          1
FALCON VILLAGE,TX    1
Name: count, Length: 31183, dtype: int64


Now you can use the `City-State` column for the index of the city DataFrame.

In [38]:
df_city.set_index(['City-State'], inplace=True)

In [39]:
df_city.head()

Unnamed: 0_level_0,lat,lng,population,density,County-State
City-State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"NEW YORK,NY",40.6943,-73.9249,18832416,10943.7,"QUEENS,NY"
"LOS ANGELES,CA",34.1141,-118.4068,11885717,3165.7,"LOS ANGELES,CA"
"CHICAGO,IL",41.8375,-87.6866,8489066,4590.3,"COOK,IL"
"MIAMI,FL",25.784,-80.2101,6113982,4791.1,"MIAMI-DADE,FL"
"HOUSTON,TX",29.786,-95.3885,6046392,1386.2,"HARRIS,TX"


## 4 &emsp; Combining Four DataFrames into One

In [40]:
df_solar = df_landvalue.copy()

df_solar = pd.merge(df_solar, df_sunshine,     on='City-State',   how='inner')
df_solar = pd.merge(df_solar, df_photovoltaic, on='County-State', how='outer')

print(df_solar.query('ANN.notnull()').head())

print(df_solar.shape)

    SizeRank     Land Value County-State City-State   JAN   FEB   MAR   APR  \
2       97.0  474416.339219       ADA,ID   BOISE,ID  32.0  49.0  66.0  68.0   
3       97.0  474416.339219       ADA,ID   BOISE,ID  32.0  49.0  66.0  68.0   
63     246.0  314400.663535    ALBANY,NY  ALBANY,NY  46.0  52.0  51.0  55.0   
64     246.0  314400.663535    ALBANY,NY  ALBANY,NY  46.0  52.0  51.0  55.0   
65     246.0  314400.663535    ALBANY,NY  ALBANY,NY  46.0  52.0  51.0  55.0   

     MAY   JUN  ...   SEP   OCT   NOV   DEC   ANN   Longitude   Latitude  \
2   74.0  76.0  ...  80.0  69.0  41.0  34.0  63.0 -116.327415  43.438301   
3   74.0  76.0  ...  80.0  69.0  41.0  34.0  63.0 -116.289497  43.468910   
63  53.0  55.0  ...  54.0  46.0  33.0  36.0  50.0  -73.865364  42.585896   
64  53.0  55.0  ...  54.0  46.0  33.0  36.0  50.0  -73.826706  42.540352   
65  53.0  55.0  ...  54.0  46.0  33.0  36.0  50.0  -73.830566  42.542080   

      AC    DC  Current  
2   40.0  54.6     94.6  
3   20.0  26.0  

In [41]:
df_solar_ml = pd.DataFrame()
df_solar_ml = df_solar[[
    'JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC',
    'ANN', 'Land Value', 'Longitude', 'Latitude', 'Current'
    ]].query('ANN.notnull()')
 
print(df_solar_ml.head())

     JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV   DEC  \
2   32.0  49.0  66.0  68.0  74.0  76.0  85.0  82.0  80.0  69.0  41.0  34.0   
3   32.0  49.0  66.0  68.0  74.0  76.0  85.0  82.0  80.0  69.0  41.0  34.0   
63  46.0  52.0  51.0  55.0  53.0  55.0  62.0  58.0  54.0  46.0  33.0  36.0   
64  46.0  52.0  51.0  55.0  53.0  55.0  62.0  58.0  54.0  46.0  33.0  36.0   
65  46.0  52.0  51.0  55.0  53.0  55.0  62.0  58.0  54.0  46.0  33.0  36.0   

     ANN     Land Value   Longitude   Latitude  Current  
2   63.0  474416.339219 -116.327415  43.438301     94.6  
3   63.0  474416.339219 -116.289497  43.468910     46.0  
63  50.0  314400.663535  -73.865364  42.585896      4.3  
64  50.0  314400.663535  -73.826706  42.540352      4.1  
65  50.0  314400.663535  -73.830566  42.542080      2.2  


In [42]:
print("df_solar            ...", df_solar.shape[0])
print("Current is not null ...", df_solar.query('Current.notnull()').shape[0])
print("Current is     null ...", df_solar.query('Current.isnull()' ).shape[0])

df_solar            ... 5761
Current is not null ... 5712
Current is     null ... 49


In [43]:
df_solar.query('Longitude.isnull()')

Unnamed: 0,SizeRank,Land Value,County-State,City-State,JAN,FEB,MAR,APR,MAY,JUN,...,SEP,OCT,NOV,DEC,ANN,Longitude,Latitude,AC,DC,Current
90,71.0,227793.6,"ALLEN,IN","FORT WAYNE,IN",50.0,55.0,57.0,63.0,69.0,74.0,...,67.0,60.0,40.0,36.0,60.0,,,,,
93,3161.0,191529.0,"ALPENA,MI","ALPENA,MI",36.0,43.0,51.0,55.0,59.0,62.0,...,52.0,41.0,28.0,25.0,48.0,,,,,
95,111.0,392428.0,"ANCHORAGE BOROUGH,AK","ANCHORAGE,AK",43.0,46.0,51.0,50.0,51.0,46.0,...,41.0,36.0,35.0,33.0,43.0,,,,,
174,35.0,206500.0,"BALTIMORE CITY,MD","BALTIMORE,MD",50.0,58.0,55.0,57.0,55.0,60.0,...,57.0,56.0,50.0,47.0,55.0,,,,,
199,844.0,317148.2,"BANNOCK,ID","POCATELLO,ID",38.0,55.0,63.0,64.0,68.0,74.0,...,78.0,68.0,43.0,36.0,62.0,,,,,
228,3961.0,195944.3,"BEADLE,SD","HURON,SD",62.0,62.0,62.0,59.0,66.0,69.0,...,69.0,59.0,51.0,51.0,63.0,,,,,
468,406.0,272902.1,"BROWN,WI","GREEN BAY,WI",51.0,56.0,55.0,56.0,64.0,66.0,...,57.0,48.0,35.0,38.0,55.0,,,,,
494,527.0,326276.7,"BURLEIGH,ND","BISMARCK,ND",54.0,52.0,61.0,58.0,64.0,67.0,...,67.0,53.0,42.0,45.0,59.0,,,,,
543,155.0,146103.1,"CADDO PARISH,LA","SHREVEPORT,LA",45.0,54.0,53.0,54.0,58.0,68.0,...,64.0,64.0,57.0,51.0,58.0,,,,,
547,285.0,211510.2,"CALCASIEU PARISH,LA","LAKE CHARLES,LA",46.0,48.0,65.0,65.0,77.0,80.0,...,83.0,77.0,55.0,49.0,67.0,,,,,


## 5 &emsp; Splitting the DataFrame for Training and Testing

In [44]:
X_train, X_test, y_train, y_test = train_test_split(
    df_solar_ml.query('Current.notnull()').drop(['Current'], axis=1),
    df_solar_ml.query('Current.notnull()')['Current'],
    random_state=42
    )

X_predict \
    = df_solar_ml.query('Current.isnull()' ).drop(['Current'], axis=1)

In [45]:
X_predict

Unnamed: 0,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,ANN,Land Value,Longitude,Latitude
90,50.0,55.0,57.0,63.0,69.0,74.0,76.0,75.0,67.0,60.0,40.0,36.0,60.0,227793.6,,
93,36.0,43.0,51.0,55.0,59.0,62.0,66.0,60.0,52.0,41.0,28.0,25.0,48.0,191529.0,,
95,43.0,46.0,51.0,50.0,51.0,46.0,43.0,43.0,41.0,36.0,35.0,33.0,43.0,392428.0,,
174,50.0,58.0,55.0,57.0,55.0,60.0,63.0,61.0,57.0,56.0,50.0,47.0,55.0,206500.0,,
199,38.0,55.0,63.0,64.0,68.0,74.0,82.0,80.0,78.0,68.0,43.0,36.0,62.0,317148.2,,
228,62.0,62.0,62.0,59.0,66.0,69.0,76.0,74.0,69.0,59.0,51.0,51.0,63.0,195944.3,,
468,51.0,56.0,55.0,56.0,64.0,66.0,69.0,65.0,57.0,48.0,35.0,38.0,55.0,272902.1,,
494,54.0,52.0,61.0,58.0,64.0,67.0,75.0,72.0,67.0,53.0,42.0,45.0,59.0,326276.7,,
543,45.0,54.0,53.0,54.0,58.0,68.0,70.0,69.0,64.0,64.0,57.0,51.0,58.0,146103.1,,
547,46.0,48.0,65.0,65.0,77.0,80.0,81.0,83.0,83.0,77.0,55.0,49.0,67.0,211510.2,,


## 6 &emsp; Predicting Current by Using Logistic Regression 

In [46]:
from sklearn.metrics import mean_squared_error

from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model  import LinearRegression, Ridge
from sklearn.neighbors     import KNeighborsRegressor
from sklearn.tree          import DecisionTreeRegressor
from sklearn.svm           import SVR

from sklearn.ensemble import VotingRegressor

from sklearn.inspection import permutation_importance

## Try-it 20_1

In [47]:
reg_linear = LinearRegression()
reg_tree   = DecisionTreeRegressor(random_state=42)
reg_ridge  = Ridge()

# Create pipelines for each regressor
pipelines = {
    'LinearRegression()':      Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', reg_linear)
        ]),
    'KNeighborsRegressor()':   Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', KNeighborsRegressor())
        ]),
    'DecisionTreeRegressor()': Pipeline([
        ('regressor', reg_tree)
        ]),
    'Ridge()':        Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', reg_ridge)
        ]),
    'SVR()':                   Pipeline([
        ('scaler',    StandardScaler()),
        ('regressor', SVR())
        ])
}

# Define the Voting Regressor
voting_reg = VotingRegressor(estimators=[
    ('LinearRegression()',      pipelines['LinearRegression()']),
    ('KNeighborsRegressor()',   pipelines['KNeighborsRegressor()']),
    ('DecisionTreeRegressor()', pipelines['DecisionTreeRegressor()']),
    ('Ridge()',                 pipelines['Ridge()']),
    ('SVR()',                   pipelines['SVR()'])
])

# Function to evaluate models
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return mse

# Evaluate individual models without grid search
results_before_gs = {}
for name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    mse = evaluate_model(pipeline, X_test, y_test)
    results_before_gs[name] = mse

# Evaluate Voting Regressor without grid search
voting_reg.fit(X_train, y_train)
voting_mse_before_gs = evaluate_model(voting_reg, X_test, y_test)
results_before_gs['VotingRegressor()'] = voting_mse_before_gs

# Print results before grid search
print("Results Before Grid Search:")
for name, mse in results_before_gs.items():
    print(f'{name}: MSE = {mse}')

Results Before Grid Search:
LinearRegression(): MSE = 14406.889378472753
KNeighborsRegressor(): MSE = 15206.99749304813
DecisionTreeRegressor(): MSE = 18635.329679144386
Ridge(): MSE = 14510.788281389701
SVR(): MSE = 17127.47048708527
VotingRegressor(): MSE = 13356.271653484504


In [48]:
X_pred_voting_reg = voting_reg.predict(X_test)
pd.concat([
    X_test.reset_index(drop=True),
    y_test.reset_index(drop=True),
    pd.DataFrame(X_pred_voting_reg).reset_index(drop=True)
    ], axis=1)

Unnamed: 0,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,ANN,Land Value,Longitude,Latitude,Current,0
0,80.0,82.0,86.0,92.0,94.0,93.0,80.0,84.0,87.0,88.0,83.0,79.0,85.0,3.332105e+05,-110.973900,32.376358,9.6,63.420445
1,69.0,70.0,70.0,78.0,63.0,63.0,82.0,85.0,74.0,74.0,70.0,71.0,72.0,9.642500e+05,-118.128540,34.708324,3.2,16.819831
2,31.0,40.0,44.0,52.0,57.0,64.0,66.0,61.0,59.0,48.0,30.0,22.0,47.0,1.227581e+05,-81.751434,41.443275,7.7,18.379031
3,59.0,61.0,56.0,59.0,58.0,61.0,66.0,63.0,59.0,57.0,47.0,50.0,58.0,2.599504e+05,-72.650566,42.009109,4.9,20.364379
4,59.0,62.0,68.0,63.0,66.0,67.0,71.0,73.0,75.0,67.0,60.0,56.0,65.0,1.178752e+06,-158.063095,21.374256,25.0,28.471558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,44.0,60.0,74.0,83.0,91.0,93.0,96.0,95.0,93.0,86.0,59.0,44.0,76.0,3.814220e+05,-120.343079,36.424538,46.3,123.946862
183,80.0,82.0,86.0,92.0,94.0,93.0,80.0,84.0,87.0,88.0,83.0,79.0,85.0,3.332105e+05,-110.891869,32.034618,265.0,62.011938
184,69.0,70.0,70.0,78.0,63.0,63.0,82.0,85.0,74.0,74.0,70.0,71.0,72.0,9.642500e+05,-118.191078,34.779060,6.9,32.564440
185,34.0,39.0,46.0,53.0,53.0,54.0,60.0,57.0,51.0,40.0,25.0,23.0,44.0,2.021899e+05,-75.958321,43.022732,12.4,12.848092


In [49]:
# Define parameter grids for Grid Search
param_grids = {
    'LinearRegression()':      {},
    'KNeighborsRegressor()':   {'regressor__n_neighbors': [3, 5, 7]},
    'DecisionTreeRegressor()': {'regressor__max_depth':   [3, 5, 7]},
    'Ridge()':                 {'regressor__alpha':       [0.1, 1.0, 10.0]},
    'SVR()':                   {
                                'regressor__C':           [0.1, 1.0, 10.0],
                                'regressor__gamma':       ['scale', 'auto']
        }
}

# Perform Grid Search and evaluate models
results_after_gs = {}
best_pipelines   = {}

for name, pipeline in pipelines.items():
    grid_search = GridSearchCV(
        pipeline, param_grids[name], cv=10, scoring='neg_mean_squared_error'
        )
    grid_search.fit(X_train, y_train)
    best_pipeline = grid_search.best_estimator_
    best_pipelines[name] = best_pipeline
    mse = evaluate_model(best_pipeline, X_test, y_test)
    results_after_gs[name] = mse

# Define the optimized Voting Regressor
optimized_voting_reg = VotingRegressor(estimators=[
    ('LinearRegression()',      best_pipelines['LinearRegression()']),
    ('KNeighborsRegressor()',   best_pipelines['KNeighborsRegressor()']),
    ('DecisionTreeRegressor()', best_pipelines['DecisionTreeRegressor()']),
    ('Ridge()',                 best_pipelines['Ridge()']),
    ('SVR()',                   best_pipelines['SVR()'])
])

# Fit the optimized Voting Regressor
optimized_voting_reg.fit(X_train, y_train)
voting_mse_after_gs = evaluate_model(optimized_voting_reg, X_test, y_test)
results_after_gs['VotingRegressor()'] = voting_mse_after_gs

# Print results after grid search
print("\nResults After Grid Search:")
for name, mse in results_after_gs.items():
    print(f'{name}: MSE = {mse}')


Results After Grid Search:
LinearRegression(): MSE = 14406.889378472753
KNeighborsRegressor(): MSE = 15557.775638740346
DecisionTreeRegressor(): MSE = 15861.672591392411
Ridge(): MSE = 14709.482916063604
SVR(): MSE = 16587.870025332246
VotingRegressor(): MSE = 13912.33523292811


---
Coefficient analysis

In [50]:
print("Coefficients from Linear Regression:")
reg_linear.fit(X_train, y_train)
coefficients_linear = reg_linear.coef_
coefficients_linear_df = pd.DataFrame({
    'feature': X_train.columns, 'coefficient': coefficients_linear
    })
coefficients_linear_df = coefficients_linear_df\
    .sort_values(by='coefficient', ascending=False)
print(coefficients_linear_df)

Coefficients from Linear Regression:
       feature  coefficient
8          SEP     9.532927
3          APR     7.685242
5          JUN     4.325897
6          JUL     3.409039
2          MAR     3.015289
0          JAN     2.681841
11         DEC     2.632793
1          FEB     2.386688
10         NOV     0.879369
13  Land Value    -0.000135
14   Longitude    -0.774182
9          OCT    -1.198499
15    Latitude    -1.247494
7          AUG    -3.539380
4          MAY    -5.272066
12         ANN   -24.021813


In [51]:
print("Permutation Importance from KNeighborsRegressor():")
result_knn = permutation_importance(
    best_pipelines['KNeighborsRegressor()'], 
    X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
    )

perm_importances_knn = pd.DataFrame({
    'feature':    X_test.columns,
    'importance': result_knn.importances_mean
    })

perm_importances_knn = perm_importances_knn\
    .sort_values(by='importance', ascending=False)

print(perm_importances_knn)

Permutation Importance from KNeighborsRegressor():
       feature  importance
15    Latitude    0.939331
14   Longitude    0.177160
0          JAN    0.149141
5          JUN    0.146407
4          MAY    0.145769
1          FEB    0.136081
3          APR    0.092702
6          JUL    0.068058
11         DEC    0.059843
8          SEP    0.056560
7          AUG    0.055536
2          MAR    0.036034
10         NOV    0.025436
13  Land Value   -0.015548
12         ANN   -0.015816
9          OCT   -0.019197


In [52]:
print("Feature Importance from Decision Tree:")
reg_tree.fit(X_train, y_train)

importances_tree = reg_tree.feature_importances_

feature_importance_tree_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': importances_tree
    })

feature_importance_tree_df = feature_importance_tree_df\
    .sort_values(by='importance', ascending=False)
print(feature_importance_tree_df)

Feature Importance from Decision Tree:
       feature  importance
15    Latitude    0.377584
14   Longitude    0.361791
0          JAN    0.142325
10         NOV    0.073265
2          MAR    0.017210
13  Land Value    0.010684
4          MAY    0.004787
11         DEC    0.004616
7          AUG    0.003745
6          JUL    0.002281
8          SEP    0.000954
9          OCT    0.000409
12         ANN    0.000267
3          APR    0.000041
1          FEB    0.000023
5          JUN    0.000018


In [53]:
print("Coefficients from Ridge Regression:")

reg_ridge.fit(X_train, y_train)
coefficients_ridge = reg_ridge.coef_
coefficients_ridge_df = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': coefficients_ridge
    })
coefficients_ridge_df = coefficients_ridge_df\
    .sort_values(by='coefficient', ascending=False)

print(coefficients_ridge_df)

Coefficients from Ridge Regression:
       feature  coefficient
8          SEP     9.448548
3          APR     7.599491
5          JUN     4.269770
6          JUL     3.301432
2          MAR     2.943856
0          JAN     2.635131
11         DEC     2.549575
1          FEB     2.285929
10         NOV     0.776409
13  Land Value    -0.000135
14   Longitude    -0.773119
9          OCT    -1.261998
15    Latitude    -1.265973
7          AUG    -3.582124
4          MAY    -5.367809
12         ANN   -23.069603


In [54]:
print("Permutation Importance from SVR():")

result_svr = permutation_importance(
    best_pipelines['SVR()'],
    X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
    )

perm_importances_svr = pd.DataFrame({
    'feature':    X_test.columns,
    'importance': result_svr.importances_mean
    })

perm_importances_svr = perm_importances_svr\
    .sort_values(by='importance', ascending=False)

print(perm_importances_svr)

Permutation Importance from SVR():
       feature  importance
1          FEB    0.013704
0          JAN    0.011926
11         DEC    0.011366
6          JUL    0.009464
2          MAR    0.008582
12         ANN    0.007932
8          SEP    0.006863
3          APR    0.006318
7          AUG    0.004339
5          JUN    0.004173
10         NOV    0.003532
9          OCT    0.003322
14   Longitude    0.000498
13  Land Value   -0.000317
15    Latitude   -0.001296
4          MAY   -0.001689
