## House Plant Analysis 

This notebook will cover the initial exploratory data analysis (step 1) of the houseplant database generated. Following this, feature engineering (step 2) will be performed for two goals. First is to perform dimensionality reduction (step 3) as a way to visualise similar plants (ultimate goal to make some 2D scatter like graphs doing this). Second goal is to build a content based recommender engine with this dataset.

#### Step 0: Setup

In [1]:
import pandas as pd
import sqlite3

DATABASE_LOC = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\PyForFun\House_Plant_Recommender\Database\house_plants.db"

In [4]:
conn = sqlite3.connect(DATABASE_LOC)
c = conn.cursor()
plant_df = pd.read_sql_query("SELECT * FROM plant_raw_data", conn)
c.close()
plant_df.head()

Unnamed: 0,Plant_Name,Common_Names,Plant_Type,Family,Zones,Native_Range,Heights,Spreads,Bloom_Times,Bloom_Description,sunlight,Watering,Maintenance,Flowers,Leafs,Fruits
0,Aechmea,"urn plant,silver vase plant",Epiphyte,Bromeliaceae,10 to 11,Brazil,1.00 to 3.00 feet,1.00 to 2.00 feet,Seasonal bloomer,Violet to red with pink bracts,Part shade,Medium,Medium,Showy,Evergreen,
1,Ardisia crenata,"hen's eyes,coralberry,spiceberry,scratchthroat...",Broadleaf evergreen,Primulaceae,8 to 10,Japan to Northern India,4.00 to 5.00 feet,4.00 to 5.00 feet,May to June,Pinkish-white,Part shade to full shade,Medium,Medium,Showy,Evergreen,Showy
2,Euphorbia milii,"christplant,Christ plant,Christ thorn,crown of...",Broadleaf evergreen,Euphorbiaceae,9 to 11,Madagascar,3.00 to 6.00 feet,1.50 to 3.00 feet,Seasonal bloomer,Green subtended by red or yellow bracts,Full sun,Dry to medium,Medium,Showy,Evergreen,
3,Ficus elastica,"Indian rubberplant,India rubber plant,rubber p...",Broadleaf evergreen,Moraceae,10 to 12,Southeastern Asia,50.00 to 100.00 feet,50.00 to 100.00 feet,Rarely flowers indoors,,Part shade,Medium,Low,Insignificant,Evergreen,
4,Woodsia obtusa,"blunt-lobed woodsia,common woodsia,large woodsia",Fern,Woodsiaceae,4 to 8,North America,1.00 to 1.50 feet,2.00 to 2.50 feet,Non-flowering,Non-flowering,Part shade to full shade,Medium,Medium,,,


In [5]:
len(plant_df)

92

#### Part 1: Data exploration


In [8]:
plant_df.columns

Index(['Plant_Name', 'Common_Names', 'Plant_Type', 'Family', 'Zones',
       'Native_Range', 'Heights', 'Spreads', 'Bloom_Times',
       'Bloom_Description', 'sunlight', 'Watering', 'Maintenance', 'Flowers',
       'Leafs', 'Fruits'],
      dtype='object')

It is fair to say  the column "Common_Names" wont be useful as a feature and can be dropped straight away. All others have some potential

In [9]:
plant_df = plant_df.drop(["Common_Names"], axis=1)

In [10]:
plant_df.head()

Unnamed: 0,Plant_Name,Plant_Type,Family,Zones,Native_Range,Heights,Spreads,Bloom_Times,Bloom_Description,sunlight,Watering,Maintenance,Flowers,Leafs,Fruits
0,Aechmea,Epiphyte,Bromeliaceae,10 to 11,Brazil,1.00 to 3.00 feet,1.00 to 2.00 feet,Seasonal bloomer,Violet to red with pink bracts,Part shade,Medium,Medium,Showy,Evergreen,
1,Ardisia crenata,Broadleaf evergreen,Primulaceae,8 to 10,Japan to Northern India,4.00 to 5.00 feet,4.00 to 5.00 feet,May to June,Pinkish-white,Part shade to full shade,Medium,Medium,Showy,Evergreen,Showy
2,Euphorbia milii,Broadleaf evergreen,Euphorbiaceae,9 to 11,Madagascar,3.00 to 6.00 feet,1.50 to 3.00 feet,Seasonal bloomer,Green subtended by red or yellow bracts,Full sun,Dry to medium,Medium,Showy,Evergreen,
3,Ficus elastica,Broadleaf evergreen,Moraceae,10 to 12,Southeastern Asia,50.00 to 100.00 feet,50.00 to 100.00 feet,Rarely flowers indoors,,Part shade,Medium,Low,Insignificant,Evergreen,
4,Woodsia obtusa,Fern,Woodsiaceae,4 to 8,North America,1.00 to 1.50 feet,2.00 to 2.50 feet,Non-flowering,Non-flowering,Part shade to full shade,Medium,Medium,,,


##### Part 1.1 - Columns: "Plant_Type" and "Family"

In [20]:
plant_df["Plant_Type"].value_counts()

Herbaceous perennial    32
Broadleaf evergreen     20
Vine                     9
Bulb                     8
Fern                     6
Palm or Cycad            4
Orchid                   3
Deciduous shrub          3
Epiphyte                 3
Needled evergreen        2
Annual                   1
Tree                     1
Name: Plant_Type, dtype: int64

Based on the above value_counts, For "Plant_Type", I can take those with a large number of observations and keep them. Those with few observations can be converted into a class names "Other". Would need one-hot encoding for this one. 

In [19]:
plant_df["Family"].value_counts()

Araceae            12
Asparagaceae        7
Euphorbiaceae       4
Moraceae            4
Piperaceae          4
Crassulaceae        4
Araliaceae          4
Arecaceae           3
Rubiaceae           3
Primulaceae         3
Aspleniaceae        3
Orchidaceae         3
Commelinaceae       3
Asphodelaceae       2
Bromeliaceae        2
Malvaceae           2
Apocynaceae         2
Solanaceae          2
Woodsiaceae         1
Araucariaceae       1
Lamiaceae           1
Amaryllidaceae      1
Marantaceae         1
Campanulaceae       1
Pteridaceae         1
Basellaceae         1
Vitaceae            1
Saxifragaceae       1
Lythraceae          1
Polypodiaceae       1
Ranunculaceae       1
Begoniaceae         1
Urticaceae          1
Melastomataceae     1
Acanthaceae         1
Iridaceae           1
Zingiberaceae       1
Cactaceae           1
Cycadaceae          1
Pinaceae            1
Asteraceae          1
Strelitziaceae      1
Theaceae            1
Name: Family, dtype: int64

The column "Family" contains an even large number of types than "Plant_Type". If I decide to use this, I would need to group a large number of the types into a group named "Other". This maybe not be ideal given they are then treated as similar even if they are not...

Also need to consider that the two columns maybe co-related, and I don't have that many samples to train with...

In [18]:
plant_df[["Plant_Type", "Family"]].value_counts(sort=False)

Plant_Type            Family         
Annual                Asteraceae         1
Broadleaf evergreen   Apocynaceae        1
                      Araliaceae         3
                      Asparagaceae       2
                      Euphorbiaceae      2
                      Malvaceae          2
                      Melastomataceae    1
                      Moraceae           4
                      Primulaceae        1
                      Rubiaceae          2
                      Solanaceae         1
                      Theaceae           1
Bulb                  Amaryllidaceae     1
                      Araceae            4
                      Asparagaceae       1
                      Iridaceae          1
                      Primulaceae        1
Deciduous shrub       Crassulaceae       1
                      Euphorbiaceae      1
                      Lythraceae         1
Epiphyte              Bromeliaceae       2
                      Cactaceae          1
Fern            

looking at the above 2D value_counts results, we observe frequent "Family" categories present in several "Plant_Type" categories.
E.g.: "Araceae" is observed in: "Bulb", "Herbaceous perennial", and "Vine". 

As "Plant_Type" is more intuiative, has less categories and is more indicative of how  the plant actually looks, if I take either of these columns forward, it will be "Plant_Type".

##### Part 1.2 - Columns: "Zones"
Zones describes the cold-hardyness of a plant. So what range of 

In [23]:
# Could I turn this into a regression, max cold hardiness? 
# could even convert to degress etc...

In [21]:
plant_df["Zones"].value_counts()

10 to 12    28
11 to 12    12
9 to 11     11
10 to 11    10
8 to 10      7
4 to 8       5
9 to 10      4
8 to 11      2
6 to 9       2
9 to 12      2
3 to 8       2
5 to 8       1
7 to 9       1
4 to 9       1
5 to 9       1
2 to 11      1
None         1
8 to 12      1
Name: Zones, dtype: int64

##### Part 1.3 - Columns: "Native_Range"

In [27]:
plant_df["Native_Range"].value_counts()

None                                                13
Brazil                                               4
Madagascar                                           3
Mediterranean                                        3
Mexico to northern South America and West Indies     2
                                                    ..
Arabian penninsula, eastern Africa                   1
Western Asia, Europe                                 1
Tropical Africa and tropical Asia                    1
Nigeria                                              1
Mexico, Central America                              1
Name: Native_Range, Length: 68, dtype: int64

Sceptical of the usefulness of this. Would need to generalise the results here a lot and even then one country or part of a continent can have very different climates in different parts of the climate. Think the more direct information from columns like: sunlight, or zones would be better suited. Also a lot with missing data. Very probably will not use this column as feature. 

##### Part 1.4 - Columns: "Heights" and "Spreads"

In [28]:
plant_df["Heights"].value_counts()

1.00 to 1.50 feet        7
0.50 to 0.75 feet        6
1.00 to 2.00 feet        6
0.50 to 1.00 feet        5
0.25 to 0.50 feet        5
3.00 to 6.00 feet        4
6.00 to 15.00 feet       3
2.00 to 3.00 feet        3
2.00 to 4.00 feet        3
3.00 to 10.00 feet       3
1.00 to 2.50 feet        3
0.75 to 1.00 feet        2
2.00 to 6.00 feet        2
20.00 to 30.00 feet      2
5.00 to 6.00 feet        2
60.00 to 100.00 feet     2
3.00 to 5.00 feet        2
40.00 to 50.00 feet      1
6.00 to 16.00 feet       1
100.00 to 200.00 feet    1
1.50 to 2.00 feet        1
8.00 to 10.00 feet       1
0.50 to 1.50 feet        1
3.50 to 4.00 feet        1
10.00 to 25.00 feet      1
1.00 to 3.00 feet        1
3.00 to 12.00 feet       1
6.00 to 10.00 feet       1
3.00 to 9.00 feet        1
12.00 to 30.00 feet      1
0.50 to 10.00 feet       1
3.00 to 8.00 feet        1
4.00 to 10.00 feet       1
40.00 to 60.00 feet      1
5.00 to 10.00 feet       1
1.00 to 4.00 feet        1
6.00 to 20.00 feet       1
0

Could turn this into a single column with max height or perhaps average height or perhaps height range.
Same thought process for spreads. Certainly can make 1 or more columns of continous data with this. }

##### Part 1.5 - Columns: "Bloom_Times" and "Bloom_Description"

##### Part 1.6 - Columns: "sunlight", "Watering", and "Maintenance"

In [29]:
plant_df["sunlight"].value_counts()

Part shade                  32
Part shade to full shade    22
Full sun to part shade      20
Full sun                    17
None                         1
Name: sunlight, dtype: int64

In [30]:
plant_df["Watering"].value_counts()

Medium           73
Dry to medium    13
Medium to wet     3
Dry               2
None              1
Name: Watering, dtype: int64

In [31]:
plant_df["Maintenance"].plant_df["Maintenance"].value_counts()

Low       51
Medium    36
High       4
None       1
Name: Maintenance, dtype: int64

In [34]:
# identify the plant with missing data
missing_index = plant_df["Maintenance"].loc[lambda x: x=="None"].index
plant_df.iloc[missing_index]

Unnamed: 0,Plant_Name,Plant_Type,Family,Zones,Native_Range,Heights,Spreads,Bloom_Times,Bloom_Description,sunlight,Watering,Maintenance,Flowers,Leafs,Fruits
23,Asplenium antiquum,Fern,Aspleniaceae,10 to 11,,1.50 to 2.00 feet,3.00 to 4.00 feet,Non-flowering,,,,,,,


I want all three of these features and all three are very easy to do ordinal encoding with so no issues with many columns (unlike with one-hot encoding...)

One plant  with missing data (see cell block above). [From checking the plant database though](https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=241952), the info is available (just in paragraph form), so I can manually add this info into this column 

##### Part 1.7 - Columns: "Flowers", and "Leafs", and "Fruits"

In [35]:
plant_df["Flowers"].value_counts()

Showy                        42
Insignificant                22
None                         16
Showy, Fragrant               8
Showy, Good Cut               2
Fragrant, Insignificant       1
Showy, Fragrant, Good Cut     1
Name: Flowers, dtype: int64

In [36]:
plant_df["Leafs"].value_counts() 
# want to check that the nones don't actually make leafs or what that is about? - lack of info?

Evergreen              49
None                   17
Colorful, Evergreen    14
Colorful               11
Fragrant, Evergreen     1
Name: Leafs, dtype: int64

In [37]:
plant_df["Fruits"].value_counts() 

None             71
Showy            14
Showy, Edible     7
Name: Fruits, dtype: int64

- For "Flowers" - Could just be a binary: "flowers" or "does not flower" (insignificant goes to not). 
- For "Fruits" - Same as above, binary fruits or not.

#### Part 2: Feature Engineering 
Based on the insights above, in this section the dataframe will be modified accordingly to build the features required for each plant.
Not all of the features made here may necessarily be used though depending on how things go

Will want this for several features: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

#### Part 3.1: Modelling - Dimensionality reduction
This section will focus on using dimensionality reduction as a way to visualise the relationship between each plant.

#### Part 3.2: Modelling - Recommender engine

Building a content based recommender enginer of course as no user data available 