# Exploring and Cleaning the Datasets for the Color Palette Generator
In this notebook, I'll explore and prepare the dataset for the ML model of the color palette generator project, I'll first start by the Palette-and-Text (PAT) dataset from the Text2Colors repository, after that I'll work with the dataset from Keras-colors repository, I'll then merge the two datasets and prepare it for my Neural Network. The links of the repositories are provided in the last section of this notebook.

### 1- Imports
all the packages needed for this notebook

In [2]:
import numpy as np
import pandas as pd 

### 2- PAT Dataset
The PAT dataset was originally in the form of a pickle file and divided into test and train sets, I loaded these pickle files and saved them as 2 csv files, one for the words and the other for the colors. I'll start working on the words dataset.
#### - The Words Dataset
This dataset consists of columns of words, a total of 11 column. the entiries varies from single words to 11 words sentences.

In [3]:
#Loading the dataset
words_df = pd.read_csv('words.csv')
words_df.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0,communism,,,,,,,,,,
1,1,dark,neon,,,,,,,,,
2,2,dont,talk,to,me,,,,,,,
3,3,good,night,princess,,,,,,,,
4,4,good,morning,princess,,,,,,,,


In [4]:
words_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10183 entries, 0 to 10182
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  10183 non-null  int64 
 1   0           10165 non-null  object
 2   1           6689 non-null   object
 3   2           2181 non-null   object
 4   3           676 non-null    object
 5   4           218 non-null    object
 6   5           80 non-null     object
 7   6           31 non-null     object
 8   7           13 non-null     object
 9   8           7 non-null      object
 10  9           3 non-null      object
 11  10          2 non-null      object
dtypes: int64(1), object(11)
memory usage: 954.8+ KB


In [6]:
words_df.shape

(10183, 12)

We have a total of 10183 phrases

In [7]:
# deleting the unnamed 0 column since it's a duplicate of the index
words_df.drop('Unnamed: 0', inplace=True, axis=1)
words_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,communism,,,,,,,,,,
1,dark,neon,,,,,,,,,
2,dont,talk,to,me,,,,,,,
3,good,night,princess,,,,,,,,
4,good,morning,princess,,,,,,,,


In [8]:
#looking at null values
words_df.isnull().sum()

0        18
1      3494
2      8002
3      9507
4      9965
5     10103
6     10152
7     10170
8     10176
9     10180
10    10181
dtype: int64

Since the columns represents words of a sentence, it is normal no have null values in column 1 to 10, however the null values at column 0 must be handled, I'll handle the null values of column 1 after I merge the words with the colors. 

In [9]:
#renaming columns, otherwise things will get messy after merging
words_df.rename(columns = {'0':'word1', '1':'word2', '2':'word3', '3':'word4', '4':'word5', '5':'word6', '6':'word7', '7':'word8', '8':'word9', '9':'word10', '10':'word11'}, inplace = True)
words_df.head()

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11
0,communism,,,,,,,,,,
1,dark,neon,,,,,,,,,
2,dont,talk,to,me,,,,,,,
3,good,night,princess,,,,,,,,
4,good,morning,princess,,,,,,,,


In [10]:
#checking duplicates
words_df.duplicated().sum()

1653

In [11]:
#looking at the duplicates
words_df[words_df.duplicated()]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11
77,fresh,sophisticate,,,,,,,,,
220,blue,,,,,,,,,,
290,pinks,,,,,,,,,,
313,cute,,,,,,,,,,
334,wizard,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
10172,watermelon,,,,,,,,,,
10177,shades,of,red,,,,,,,,
10178,grace,,,,,,,,,,
10179,pastel,mint,,,,,,,,,


Let's Invistigate more about the duplicate by looking at an example, we will look at the duplicates of blue

In [9]:
words_df[words_df['word1']=='blue']

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11
80,blue,,,,,,,,,,
184,blue,gray,,,,,,,,,
220,blue,,,,,,,,,,
316,blue,hues,,,,,,,,,
650,blue,skies,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
9849,blue,and,orange,,,,,,,,
9962,blue,,,,,,,,,,
9975,blue,,,,,,,,,,
10068,blue,oyster,,,,,,,,,


There are indeed multiple duplicates for the word blue alone, however I believe that each one of these entries has a different shades of the color blue associated with it, so I'll also leave handling the duplicates till I merge the words and the colors datasets

#### - The Colors Dataset
The colors datasets consists of 5 colors for each entry, each color in the (red,green,blue) format.

In [12]:
#Loading the dataset
colors_df = pd.read_csv('colors.csv')
colors_df.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4
0,0,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
1,1,"(116, 35, 35)","(117, 122, 37)","(35, 118, 41)","(44, 41, 114)","(115, 36, 116)"
2,2,"(192, 237, 251)","(192, 221, 235)","(192, 205, 219)","(192, 189, 203)","(192, 173, 187)"
3,3,"(49, 40, 66)","(65, 55, 84)","(87, 73, 105)","(120, 100, 131)","(156, 137, 161)"
4,4,"(246, 245, 230)","(244, 236, 217)","(244, 226, 204)","(240, 208, 186)","(234, 189, 169)"


In [13]:
colors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10183 entries, 0 to 10182
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  10183 non-null  int64 
 1   0           10183 non-null  object
 2   1           10183 non-null  object
 3   2           10183 non-null  object
 4   3           10183 non-null  object
 5   4           10183 non-null  object
dtypes: int64(1), object(5)
memory usage: 477.5+ KB


In [14]:
colors_df.shape

(10183, 6)

We have a total of 10183 palettes

In [15]:
# deleting the unnamed 0 column since it's a duplicate of the index
colors_df.drop('Unnamed: 0', inplace=True, axis=1)
colors_df.head()

Unnamed: 0,0,1,2,3,4
0,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
1,"(116, 35, 35)","(117, 122, 37)","(35, 118, 41)","(44, 41, 114)","(115, 36, 116)"
2,"(192, 237, 251)","(192, 221, 235)","(192, 205, 219)","(192, 189, 203)","(192, 173, 187)"
3,"(49, 40, 66)","(65, 55, 84)","(87, 73, 105)","(120, 100, 131)","(156, 137, 161)"
4,"(246, 245, 230)","(244, 236, 217)","(244, 226, 204)","(240, 208, 186)","(234, 189, 169)"


In [16]:
#Renaming the columns
colors_df.rename(columns = {'0':'Color1', '1':'Color2', '2':'Color3', '3':'Color4', '4':'Color5'}, inplace = True)
colors_df.head()

Unnamed: 0,Color1,Color2,Color3,Color4,Color5
0,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
1,"(116, 35, 35)","(117, 122, 37)","(35, 118, 41)","(44, 41, 114)","(115, 36, 116)"
2,"(192, 237, 251)","(192, 221, 235)","(192, 205, 219)","(192, 189, 203)","(192, 173, 187)"
3,"(49, 40, 66)","(65, 55, 84)","(87, 73, 105)","(120, 100, 131)","(156, 137, 161)"
4,"(246, 245, 230)","(244, 236, 217)","(244, 226, 204)","(240, 208, 186)","(234, 189, 169)"


In [17]:
#checking duplicates
colors_df.duplicated().sum()

15

In [18]:
#Taking a closer look at the duplicates
colors_df[colors_df.duplicated()]

Unnamed: 0,Color1,Color2,Color3,Color4,Color5
966,"(238, 238, 238)","(221, 221, 221)","(204, 204, 204)","(187, 187, 187)","(170, 170, 170)"
1797,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
1930,"(238, 238, 238)","(221, 221, 221)","(204, 204, 204)","(187, 187, 187)","(170, 170, 170)"
2627,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
4414,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
6257,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
7047,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
7229,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
7357,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
7844,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"


Some of the duplicated palettes are fully black, fully white, a mix of both, or a normal palette. This could be related to the words they are assigned to, for example: white, light can be both assigned to fully white palette. Yet again, I won't make a decision on these duplicates till I merge the words with the colors. I'll save the indices of these duplicates and check them after merging

In [27]:
dup_colors = colors_df[colors_df.duplicated()].index.tolist()

Since we are working with colors, let's try to visualize it by using Panda's Style methods. Please note that these styling won't be saved with the dataframe not showing on the Github review of this notebook, the styler is working using VScode, I didn't test another software. 

In [22]:
#Trying to display each color using style
colors_df.sample(10).style.applymap(lambda x:"background-color: rgb%s"%x)

Unnamed: 0,Color1,Color2,Color3,Color4,Color5
1078,"(248, 177, 149)","(246, 114, 128)","(192, 108, 132)","(108, 91, 123)","(53, 92, 125)"
5236,"(121, 141, 141)","(111, 139, 141)","(102, 132, 134)","(96, 122, 125)","(65, 122, 127)"
8358,"(141, 64, 50)","(101, 53, 47)","(197, 126, 77)","(36, 34, 35)","(189, 94, 255)"
1546,"(255, 228, 225)","(0, 128, 128)","(104, 151, 187)","(255, 115, 115)","(12, 156, 124)"
1644,"(255, 203, 232)","(255, 244, 250)","(255, 230, 249)","(255, 234, 238)","(255, 242, 210)"
886,"(227, 210, 192)","(199, 172, 145)","(148, 112, 75)","(79, 72, 56)","(75, 57, 43)"
9305,"(66, 69, 75)","(85, 90, 80)","(199, 210, 174)","(45, 33, 33)","(69, 56, 56)"
6475,"(0, 0, 0)","(62, 0, 0)","(124, 0, 0)","(186, 0, 0)","(248, 0, 0)"
6314,"(255, 27, 182)","(45, 215, 209)","(220, 180, 248)","(107, 41, 127)","(0, 0, 0)"
5953,"(252, 253, 152)","(245, 253, 116)","(244, 229, 77)","(255, 193, 1)","(253, 255, 0)"


#### - Merging the Words and the Color datasets

In [23]:
#merging
text2palette_df = words_df.join(colors_df)
text2palette_df.head()

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
0,communism,,,,,,,,,,,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
1,dark,neon,,,,,,,,,,"(116, 35, 35)","(117, 122, 37)","(35, 118, 41)","(44, 41, 114)","(115, 36, 116)"
2,dont,talk,to,me,,,,,,,,"(192, 237, 251)","(192, 221, 235)","(192, 205, 219)","(192, 189, 203)","(192, 173, 187)"
3,good,night,princess,,,,,,,,,"(49, 40, 66)","(65, 55, 84)","(87, 73, 105)","(120, 100, 131)","(156, 137, 161)"
4,good,morning,princess,,,,,,,,,"(246, 245, 230)","(244, 236, 217)","(244, 226, 204)","(240, 208, 186)","(234, 189, 169)"


In [24]:
text2palette_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10183 entries, 0 to 10182
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word1   10165 non-null  object
 1   word2   6689 non-null   object
 2   word3   2181 non-null   object
 3   word4   676 non-null    object
 4   word5   218 non-null    object
 5   word6   80 non-null     object
 6   word7   31 non-null     object
 7   word8   13 non-null     object
 8   word9   7 non-null      object
 9   word10  3 non-null      object
 10  word11  2 non-null      object
 11  Color1  10183 non-null  object
 12  Color2  10183 non-null  object
 13  Color3  10183 non-null  object
 14  Color4  10183 non-null  object
 15  Color5  10183 non-null  object
dtypes: object(16)
memory usage: 1.2+ MB


In [25]:
text2palette_df.shape

(10183, 16)

In [26]:
#checking nulls
text2palette_df.isnull().sum()

word1        18
word2      3494
word3      8002
word4      9507
word5      9965
word6     10103
word7     10152
word8     10170
word9     10176
word10    10180
word11    10181
Color1        0
Color2        0
Color3        0
Color4        0
Color5        0
dtype: int64

In [22]:
#checking the rows that has null values in word1
text2palette_df[text2palette_df['word1'].isnull()]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
214,,hours,crazy,,,,,,,,,"(102, 255, 0)","(0, 255, 252)","(255, 0, 244)","(255, 0, 0)","(255, 231, 0)"
542,,shades,of,grey,,,,,,,,"(216, 216, 216)","(167, 167, 167)","(103, 103, 103)","(49, 49, 49)","(0, 0, 0)"
1859,,shades,of,green,,,,,,,,"(0, 156, 26)","(34, 182, 0)","(38, 204, 0)","(123, 227, 130)","(210, 242, 212)"
5409,,shades,of,gray,,,,,,,,"(236, 236, 236)","(221, 221, 221)","(204, 204, 204)","(187, 187, 187)","(170, 170, 170)"
5670,,shades,of,brown,,,,,,,,"(236, 208, 124)","(178, 152, 92)","(141, 116, 78)","(116, 94, 52)","(96, 80, 40)"
6231,,shades,of,grey,,,,,,,,"(221, 219, 219)","(221, 221, 221)","(144, 142, 142)","(187, 187, 187)","(139, 136, 136)"
6257,,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
6737,,shades,of,orange,,,,,,,,"(240, 161, 80)","(240, 149, 55)","(244, 128, 32)","(240, 117, 15)","(199, 103, 6)"
6738,,shades,of,red,,,,,,,,"(249, 70, 9)","(251, 1, 1)","(227, 41, 20)","(199, 56, 19)","(165, 21, 13)"
6739,,shades,of,pink,,,,,,,,"(249, 187, 230)","(242, 158, 200)","(238, 115, 196)","(244, 83, 173)","(242, 11, 151)"


There are sentences that start from word2 instead of word1, judging from the sentences it seems that the first word was a number and got filtered when gathering the dataset. I'll delete the fully null rows, row number 6257 and 7229, but before deleting rows and changing the indexing let's check on the duplicated colors from before, are they really related to black and white?

In [28]:
text2palette_df.iloc[dup_colors,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
966,not,found,,,,,,,,,,"(238, 238, 238)","(221, 221, 221)","(204, 204, 204)","(187, 187, 187)","(170, 170, 170)"
1797,coal,,,,,,,,,,,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
1930,light,greys,,,,,,,,,,"(238, 238, 238)","(221, 221, 221)","(204, 204, 204)","(187, 187, 187)","(170, 170, 170)"
2627,darkness,,,,,,,,,,,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
4414,all,black,,,,,,,,,,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
6257,,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
7047,dark,shit,,,,,,,,,,"(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)","(0, 0, 0)"
7229,,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
7357,none,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
7844,empty,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"


yes they are indeed related to words like black and light. we can proceed with deleting the fully NaN rows

In [29]:
#checking the dataframe before deleting the row
text2palette_df.iloc[6255:6260,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
6255,sundance,,,,,,,,,,,"(0, 178, 182)","(29, 195, 182)","(74, 223, 182)","(187, 234, 121)","(255, 248, 93)"
6256,come,a,little,closer,,,,,,,,"(69, 29, 11)","(214, 86, 0)","(255, 93, 126)","(251, 142, 175)","(231, 179, 255)"
6257,,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"
6258,still,smoke,,,,,,,,,,"(209, 219, 221)","(206, 209, 214)","(197, 197, 197)","(175, 201, 199)","(186, 184, 175)"
6259,daydreams,,,,,,,,,,,"(167, 227, 192)","(123, 186, 166)","(85, 130, 180)","(127, 138, 193)","(169, 154, 234)"


In [30]:
text2palette_df.iloc[7225:7230,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
7225,used,,,,,,,,,,,"(207, 160, 16)","(232, 179, 12)","(188, 146, 14)","(195, 151, 18)","(245, 208, 10)"
7226,golds,yellow,,,,,,,,,,"(255, 238, 0)","(255, 225, 53)","(255, 227, 45)","(255, 223, 0)","(255, 215, 0)"
7227,ice,like,,,,,,,,,,"(234, 225, 225)","(176, 210, 219)","(134, 134, 242)","(77, 63, 163)","(34, 41, 128)"
7228,vampires,,,,,,,,,,,"(99, 68, 68)","(154, 49, 12)","(221, 38, 38)","(176, 79, 79)","(178, 153, 153)"
7229,,,,,,,,,,,,"(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)","(255, 255, 255)"


In [31]:
#deleting the row and checking the dataframe
text2palette_df = text2palette_df.drop([6257,7229])
text2palette_df.iloc[6255:6260,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
6255,sundance,,,,,,,,,,,"(0, 178, 182)","(29, 195, 182)","(74, 223, 182)","(187, 234, 121)","(255, 248, 93)"
6256,come,a,little,closer,,,,,,,,"(69, 29, 11)","(214, 86, 0)","(255, 93, 126)","(251, 142, 175)","(231, 179, 255)"
6258,still,smoke,,,,,,,,,,"(209, 219, 221)","(206, 209, 214)","(197, 197, 197)","(175, 201, 199)","(186, 184, 175)"
6259,daydreams,,,,,,,,,,,"(167, 227, 192)","(123, 186, 166)","(85, 130, 180)","(127, 138, 193)","(169, 154, 234)"
6260,natalie,,,,,,,,,,,"(222, 240, 239)","(174, 236, 236)","(167, 210, 229)","(221, 206, 238)","(179, 242, 204)"


In [32]:
text2palette_df.iloc[7225:7230,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
7226,golds,yellow,,,,,,,,,,"(255, 238, 0)","(255, 225, 53)","(255, 227, 45)","(255, 223, 0)","(255, 215, 0)"
7227,ice,like,,,,,,,,,,"(234, 225, 225)","(176, 210, 219)","(134, 134, 242)","(77, 63, 163)","(34, 41, 128)"
7228,vampires,,,,,,,,,,,"(99, 68, 68)","(154, 49, 12)","(221, 38, 38)","(176, 79, 79)","(178, 153, 153)"
7230,icecream,parlour,,,,,,,,,,"(29, 229, 211)","(29, 229, 211)","(246, 241, 99)","(248, 140, 187)","(248, 140, 187)"
7231,mustard,sea,,,,,,,,,,"(105, 231, 159)","(240, 196, 25)","(140, 144, 141)","(240, 196, 25)","(105, 231, 159)"


In [33]:
# fixing the indices
text2palette_df = text2palette_df.reset_index(drop=True) 
text2palette_df.iloc[6255:6260,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
6255,sundance,,,,,,,,,,,"(0, 178, 182)","(29, 195, 182)","(74, 223, 182)","(187, 234, 121)","(255, 248, 93)"
6256,come,a,little,closer,,,,,,,,"(69, 29, 11)","(214, 86, 0)","(255, 93, 126)","(251, 142, 175)","(231, 179, 255)"
6257,still,smoke,,,,,,,,,,"(209, 219, 221)","(206, 209, 214)","(197, 197, 197)","(175, 201, 199)","(186, 184, 175)"
6258,daydreams,,,,,,,,,,,"(167, 227, 192)","(123, 186, 166)","(85, 130, 180)","(127, 138, 193)","(169, 154, 234)"
6259,natalie,,,,,,,,,,,"(222, 240, 239)","(174, 236, 236)","(167, 210, 229)","(221, 206, 238)","(179, 242, 204)"


In [34]:
text2palette_df.iloc[7225:7230,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
7225,golds,yellow,,,,,,,,,,"(255, 238, 0)","(255, 225, 53)","(255, 227, 45)","(255, 223, 0)","(255, 215, 0)"
7226,ice,like,,,,,,,,,,"(234, 225, 225)","(176, 210, 219)","(134, 134, 242)","(77, 63, 163)","(34, 41, 128)"
7227,vampires,,,,,,,,,,,"(99, 68, 68)","(154, 49, 12)","(221, 38, 38)","(176, 79, 79)","(178, 153, 153)"
7228,icecream,parlour,,,,,,,,,,"(29, 229, 211)","(29, 229, 211)","(246, 241, 99)","(248, 140, 187)","(248, 140, 187)"
7229,mustard,sea,,,,,,,,,,"(105, 231, 159)","(240, 196, 25)","(140, 144, 141)","(240, 196, 25)","(105, 231, 159)"


Fully null words are now deleted, let's look for any duplicates

In [35]:
#checking on duplicates
text2palette_df.duplicated().sum()

1

In [36]:
#looking at the duplicates
text2palette_df[text2palette_df.duplicated()]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
7865,zebra,,,,,,,,,,,"(0, 0, 0)","(255, 255, 255)","(0, 0, 0)","(255, 255, 255)","(0, 0, 0)"


In [37]:
#checking on all zebra entiries
text2palette_df[text2palette_df['word1']=='zebra']

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
1347,zebra,,,,,,,,,,,"(11, 10, 10)","(255, 255, 255)","(21, 18, 18)","(255, 255, 255)","(0, 0, 0)"
5185,zebra,,,,,,,,,,,"(0, 0, 0)","(255, 255, 255)","(0, 0, 0)","(255, 255, 255)","(0, 0, 0)"
7865,zebra,,,,,,,,,,,"(0, 0, 0)","(255, 255, 255)","(0, 0, 0)","(255, 255, 255)","(0, 0, 0)"


In [38]:
#deleting thh duplicated zebra
text2palette_df = text2palette_df.drop([7865])
text2palette_df = text2palette_df.reset_index(drop=True) 
text2palette_df.iloc[7864:7868,:]

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
7864,suit,and,tie,,,,,,,,,"(0, 2, 13)","(225, 234, 225)","(2, 2, 16)","(241, 229, 229)","(22, 3, 3)"
7865,skin,,,,,,,,,,,"(255, 223, 196)","(225, 184, 153)","(229, 184, 135)","(240, 213, 190)","(255, 220, 178)"
7866,do,not,,,,,,,,,,"(0, 0, 0)","(0, 161, 0)","(185, 168, 31)","(132, 123, 123)","(234, 207, 207)"
7867,hospice,,,,,,,,,,,"(176, 153, 142)","(207, 190, 133)","(239, 233, 211)","(81, 40, 38)","(0, 0, 0)"


In [39]:
text2palette_df.duplicated().sum()

0

Now that we handeled the duplicates and the fullly null values, let's take a look at our color palettes with the words that it represents using styler again.

In [40]:
text2palette_df.sample(10).style.applymap(lambda x:"background-color: rgb%s"%x)

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
5184,blue,jeans,,,,,,,,,,"(179, 199, 214)","(147, 168, 186)","(104, 117, 165)","(75, 87, 146)","(63, 70, 124)"
1676,cool,autumn,breeze,,,,,,,,,"(227, 125, 125)","(227, 156, 125)","(255, 198, 110)","(221, 183, 110)","(251, 247, 153)"
9820,,greys,and,,reds,,,,,,,"(116, 111, 111)","(163, 161, 161)","(204, 204, 204)","(244, 19, 19)","(255, 91, 91)"
804,young,and,having,fun,,,,,,,,"(201, 216, 235)","(255, 255, 255)","(237, 224, 190)","(194, 151, 129)","(112, 12, 65)"
837,not,alike,,,,,,,,,,"(255, 201, 201)","(255, 124, 126)","(204, 99, 101)","(127, 62, 63)","(94, 45, 47)"
8033,new,beach,,,,,,,,,,"(176, 226, 255)","(54, 184, 234)","(32, 178, 170)","(139, 87, 66)","(205, 129, 98)"
9794,lacking,in,saturation,,,,,,,,,"(163, 124, 124)","(205, 171, 131)","(192, 188, 124)","(152, 172, 154)","(113, 114, 131)"
9046,minty,flavors,,,,,,,,,,"(201, 239, 228)","(180, 215, 205)","(160, 191, 182)","(140, 167, 159)","(120, 143, 136)"
10163,purple,unicorn,,,,,,,,,,"(107, 64, 216)","(125, 74, 227)","(150, 74, 249)","(171, 103, 210)","(218, 77, 255)"
3226,hot,head,,,,,,,,,,"(255, 53, 53)","(255, 46, 46)","(255, 37, 37)","(255, 28, 28)","(255, 0, 0)"


The palettes are really fun to invistigate and explore, I'll now save this dataframe into a csv file.

In [41]:
#save final dataframe, Text2Palette (T2P)
text2palette_df.to_csv('T2P.csv', index=False)

### 3- Keras_Colors Dataset
This dataset was obtained from a neural network project that takes a word and generate a color, this is exactly the neural network that I'll be using as a reference for this project. It is not clear how this dataset was obtained, but it consists of words and one color to each word. This is the format that I'll use to teach the neural network, so after investigating this dataset I'll modify and merge the T2P dataframe with it.

In [42]:
#loading the dataset
k_colors_df = pd.read_csv('keras_colors.csv')
k_colors_df.head()

Unnamed: 0,name,red,green,blue
0,parakeet,174,182,87
1,saddle brown,88,52,1
2,cucumber crush,222,237,215
3,pool blue,134,194,201
4,distance,98,110,130


In [43]:
k_colors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14157 entries, 0 to 14156
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    14157 non-null  object
 1   red     14157 non-null  int64 
 2   green   14157 non-null  int64 
 3   blue    14157 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 442.5+ KB


In [44]:
k_colors_df.shape

(14157, 4)

In [46]:
#checking on nulls
k_colors_df.isnull().sum()

name     0
red      0
green    0
blue     0
dtype: int64

In [47]:
#checking on duplicates
k_colors_df.duplicated().sum()

0

In [49]:
#rename a column 'name' to 'word' since this is what i've been using to refere to the phrases before
k_colors_df.rename(columns={'name': 'word'}, inplace=True)
k_colors_df.head()

Unnamed: 0,word,red,green,blue
0,parakeet,174,182,87
1,saddle brown,88,52,1
2,cucumber crush,222,237,215
3,pool blue,134,194,201
4,distance,98,110,130


There are no nulls nor duplicates, the data is clean and ready to works with, I'll now modify the T2P dataset to be in the same format as the keras colors dataset.

### 4- Modifying and Merging the T2P Dataset
Despite T2P dataset being very clear and promising with full palettes, I couldn't manage to design a proper neural network that'll make a good use of it. This is a project with a deadline, so I decided to follow the approach of keras_color repository and generate only 1 color using the neural network instead of generating a full 5 colors palette. In this section of this notebook I'll take the entries that contain 1 words only and add it to the keras color dataset. 

In [52]:
t2p_df = text2palette_df
t2p_df.head()

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
0,communism,,,,,,,,,,,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
1,dark,neon,,,,,,,,,,"(116, 35, 35)","(117, 122, 37)","(35, 118, 41)","(44, 41, 114)","(115, 36, 116)"
2,dont,talk,to,me,,,,,,,,"(192, 237, 251)","(192, 221, 235)","(192, 205, 219)","(192, 189, 203)","(192, 173, 187)"
3,good,night,princess,,,,,,,,,"(49, 40, 66)","(65, 55, 84)","(87, 73, 105)","(120, 100, 131)","(156, 137, 161)"
4,good,morning,princess,,,,,,,,,"(246, 245, 230)","(244, 236, 217)","(244, 226, 204)","(240, 208, 186)","(234, 189, 169)"


In [53]:
#use indicies that have only 1 word, in other words entiries that have NaN in 'word2' column and beyond
t2p_df = t2p_df[t2p_df['word2'].isnull()]
t2p_df.head()

Unnamed: 0,word1,word2,word3,word4,word5,word6,word7,word8,word9,word10,word11,Color1,Color2,Color3,Color4,Color5
0,communism,,,,,,,,,,,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
9,wedding,,,,,,,,,,,"(218, 226, 227)","(158, 181, 184)","(238, 222, 213)","(227, 183, 159)","(206, 154, 126)"
11,bakery,,,,,,,,,,,"(240, 215, 167)","(195, 121, 96)","(137, 78, 63)","(238, 225, 186)","(156, 99, 79)"
15,tenderness,,,,,,,,,,,"(255, 231, 185)","(255, 235, 197)","(255, 239, 208)","(255, 243, 220)","(255, 247, 231)"
19,exuberance,,,,,,,,,,,"(249, 58, 58)","(241, 253, 59)","(73, 205, 244)","(174, 135, 135)","(154, 142, 103)"


In [54]:
#drop all other columns
t2p_df = t2p_df.drop(['word2','word3','word4','word5','word6','word7','word8','word9','word10','word11'], axis=1)
t2p_df.head()

Unnamed: 0,word1,Color1,Color2,Color3,Color4,Color5
0,communism,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
9,wedding,"(218, 226, 227)","(158, 181, 184)","(238, 222, 213)","(227, 183, 159)","(206, 154, 126)"
11,bakery,"(240, 215, 167)","(195, 121, 96)","(137, 78, 63)","(238, 225, 186)","(156, 99, 79)"
15,tenderness,"(255, 231, 185)","(255, 235, 197)","(255, 239, 208)","(255, 243, 220)","(255, 247, 231)"
19,exuberance,"(249, 58, 58)","(241, 253, 59)","(73, 205, 244)","(174, 135, 135)","(154, 142, 103)"


In [55]:
#reset the indexing
t2p_df = t2p_df.reset_index(drop=True)
t2p_df.head()

Unnamed: 0,word1,Color1,Color2,Color3,Color4,Color5
0,communism,"(206, 18, 18)","(193, 27, 27)","(201, 149, 149)","(191, 138, 138)","(165, 27, 27)"
1,wedding,"(218, 226, 227)","(158, 181, 184)","(238, 222, 213)","(227, 183, 159)","(206, 154, 126)"
2,bakery,"(240, 215, 167)","(195, 121, 96)","(137, 78, 63)","(238, 225, 186)","(156, 99, 79)"
3,tenderness,"(255, 231, 185)","(255, 235, 197)","(255, 239, 208)","(255, 243, 220)","(255, 247, 231)"
4,exuberance,"(249, 58, 58)","(241, 253, 59)","(73, 205, 244)","(174, 135, 135)","(154, 142, 103)"


In [56]:
t2p_df.shape

(3491, 6)

I'm not sure if I should take only 1 color from the 5 colors or to take them all and duplicate each entry 5 times to cover each color. I believe this will work well with palettes that represents a different shade of the same color, but for other palettes that for example represent words like 'beach' and 'summer' it'll add noise to the neural network, the network will probably average the 5 colors and provide a color that may feel off. I was planning to test the network multiple times one with all 5 colors and one with only a single color, but training the network takes time and the deadline is near. I decided to go with the full colors just because the idea of a larger dataset is attractive.

In [57]:
#using melt method 
t2p_df = t2p_df.melt(id_vars=["word1"],  
        value_name="Color")
t2p_df.head()

Unnamed: 0,word1,variable,Color
0,communism,Color1,"(206, 18, 18)"
1,wedding,Color1,"(218, 226, 227)"
2,bakery,Color1,"(240, 215, 167)"
3,tenderness,Color1,"(255, 231, 185)"
4,exuberance,Color1,"(249, 58, 58)"


In [58]:
#dropping the variable column since we don't need to know which color was which in the palette
t2p_df = t2p_df.drop('variable', axis=1)
t2p_df.head()

Unnamed: 0,word1,Color
0,communism,"(206, 18, 18)"
1,wedding,"(218, 226, 227)"
2,bakery,"(240, 215, 167)"
3,tenderness,"(255, 231, 185)"
4,exuberance,"(249, 58, 58)"


In [59]:
# 3491 * 5 = 17455
t2p_df.shape

(17455, 2)

I'll now modify the format of the colors such that it'll be in the same format as the keras_colors dataset, which is red, green, and blue and seperated column. The color in the T2P dataset is in a string format, I'll change this to int as well.

In [61]:
# delete the brackets
t2p_df['Color'] = t2p_df['Color'].str.replace("(","")
t2p_df['Color'] = t2p_df['Color'].str.replace(")","")
t2p_df.head()

  t2p_df['Color'] = t2p_df['Color'].str.replace("(","")
  t2p_df['Color'] = t2p_df['Color'].str.replace(")","")


Unnamed: 0,word1,Color
0,communism,"206, 18, 18"
1,wedding,"218, 226, 227"
2,bakery,"240, 215, 167"
3,tenderness,"255, 231, 185"
4,exuberance,"249, 58, 58"


In [62]:
# adding the new columns 
t2p_df[['red', 'green', 'blue']] = t2p_df['Color'].str.split(', ', expand=True)
t2p_df.head()

Unnamed: 0,word1,Color,red,green,blue
0,communism,"206, 18, 18",206,18,18
1,wedding,"218, 226, 227",218,226,227
2,bakery,"240, 215, 167",240,215,167
3,tenderness,"255, 231, 185",255,231,185
4,exuberance,"249, 58, 58",249,58,58


In [63]:
#changing the type to int and dropping the color column
t2p_df[['red', 'green', 'blue']] = t2p_df[['red', 'green', 'blue']].astype(int)
t2p_df = t2p_df.drop('Color', axis=1)
t2p_df.head()

Unnamed: 0,word1,red,green,blue
0,communism,206,18,18
1,wedding,218,226,227
2,bakery,240,215,167
3,tenderness,255,231,185
4,exuberance,249,58,58


In [66]:
#changing word1 to word 
t2p_df.rename(columns={'word1': 'word'}, inplace=True)
t2p_df.head()

Unnamed: 0,word,red,green,blue
0,communism,206,18,18
1,wedding,218,226,227
2,bakery,240,215,167
3,tenderness,255,231,185
4,exuberance,249,58,58


Now we can finally merge the two dataframes

In [67]:
final_df = pd.concat([k_colors_df, t2p_df])
final_df.head()


Unnamed: 0,word,red,green,blue
0,parakeet,174,182,87
1,saddle brown,88,52,1
2,cucumber crush,222,237,215
3,pool blue,134,194,201
4,distance,98,110,130


In [68]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31612 entries, 0 to 17454
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    31612 non-null  object
 1   red     31612 non-null  int64 
 2   green   31612 non-null  int64 
 3   blue    31612 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.2+ MB


In [69]:
final_df.shape

(31612, 4)

In [70]:
#save it into a csv file
final_df.to_csv('final_dataset.csv', index=False)

### 5- Final Thoughts
Now we are ready to work with the neural network, this EDA can be improved a lot, hopefully in the future. Before deciding on the final format of the dataset there were many trails with many neural networks, some with %3 accuracy score. In the model notebook you can see the finalized neural network using keras, you can also train the network all over again or just load it and use the predict method to generate a single color. For palette generating please run the GUI file.

### 6- Resources
- Text2Color repository:
https://github.com/awesome-davian/Text2Colors

- Keras-colors repository:
https://github.com/Tony607/Keras-Colors