In [65]:
import pandas as pd

#Data Exploration

We'll start with general information about the dataset

In [66]:
#Name of csv is fer2013.csv
df = pd.read_csv('fer2013.csv')

###Number of entries/observations

In [67]:
entryCount = len(df)
print("The number of entries is:", entryCount)
print("This means we have",entryCount, "images.")

The number of entries is: 35887
This means we have 35887 images.


###An entry is composed of 3 things: emotion, pixels, and Usage.



In [68]:
#An entry is comprised of an emotion, the pixels, and the usage (training vs testing)
print(df.columns)

Index(['emotion', 'pixels', 'Usage'], dtype='object')


**Emotion**: Encoded from 0-6. Each number is encoded as a different emotion. 0 is angry, 1 is disgust, 2 is fear, 3 is happy, 4 is sadness, 5 is surprise, and 6 is neutral.
</br>

**Pixels**: 48x48 image. Each number is one pixel of the 48x48 image. The max value is 255 which is white and the lowest is 0 which is black. This is because the images are grey scale. Will most likely normalize these pixel values
</br>

**Usage**: Either testing, publictest, or private test. May remove this to gurantee the model is authentic as it could be (the model should be able to perform well regardless of what data it gets)

In [69]:
#Confirming that the above is true. Note 48*48 = 2304
print("Encodings are:",df['emotion'].unique(),"\n")
print(df["pixels"].apply(lambda n:(len(n.split()))))
print("\nDifferent usages are:",df['Usage'].unique())

Encodings are: [0 2 4 6 3 5 1] 

0        2304
1        2304
2        2304
3        2304
4        2304
         ... 
35882    2304
35883    2304
35884    2304
35885    2304
35886    2304
Name: pixels, Length: 35887, dtype: int64

Different usages are: ['Training' 'PublicTest' 'PrivateTest']


###How much of each image class (emotion) do we have?

In [70]:
df.emotion.value_counts()

3    8989
6    6198
4    6077
2    5121
0    4953
5    4002
1     547
Name: emotion, dtype: int64

We will probably divide the dataset such that each emotion only has 547 images (bounded by the least). This is because we don't want the model to become bias to seeing one emotion significantly more than another. This will also speed up computation time.
</br>


***We have already created a Python script to do this and will be on the GitHub:***
</br>
<details>
  <summary>Click for script</summary>
  

```
LIMIT = 547

def isLimit(map):
    for key in map:
        if map[key] < LIMIT:
            return False
    return True

def saveData(data):
    with open('face-emo.csv', 'w') as file:
        file.writelines(data)
    print("data saved under face-emo.csv")
    return


"""
    { 0: "Angry", 1: "Disgust", 2: "Fear", 3: "Happy", 4: "Sad", 5: "Surprize", 6: "Neutral" }
    
    This function is used to balance the data between emotions so we have an
    even number of pixels for each categories. The new data is saved in a new
    file called face-emo.csv.
    
    TODO: We might need to remove the Usage Column
"""
def fetch_data():
    data = []
    classes = {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}

    with open('fer2013.csv', 'r') as file:
        data.append(file.readline())

        for line in file:
            emo = int(line.split(',')[0])
            
            if classes[emo] != LIMIT:
                data.append(line)
                classes[emo] += 1
        
            if isLimit(classes) == True:
                break

        print(classes)

        # print(f"length of data should equal 547 * 7. data = {len(data) - 1} == {547 * 7}")
        # Saving new data
        saveData(data)

    return


if __name__ == "__main__":
    fetch_data()
```

  
</details>

###Handling Null Data
We already shouldn't have any null data. However, we'll confirm it for good measure.

In [71]:
df.isna()

Unnamed: 0,emotion,pixels,Usage
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
35882,False,False,False
35883,False,False,False
35884,False,False,False
35885,False,False,False


In [72]:
df.dropna()

Unnamed: 0,emotion,pixels,Usage
0,0,70 80 82 72 58 58 60 63 54 58 60 48 89 115 121...,Training
1,0,151 150 147 155 148 133 111 140 170 174 182 15...,Training
2,2,231 212 156 164 174 138 161 173 182 200 106 38...,Training
3,4,24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1...,Training
4,6,4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84...,Training
...,...,...,...
35882,6,50 36 17 22 23 29 33 39 34 37 37 37 39 43 48 5...,PrivateTest
35883,3,178 174 172 173 181 188 191 194 196 199 200 20...,PrivateTest
35884,0,17 17 16 23 28 22 19 17 25 26 20 24 31 19 27 9...,PrivateTest
35885,3,30 28 28 29 31 30 42 68 79 81 77 67 67 71 63 6...,PrivateTest


Dropping null data won't ruin the model. We'll just have less images to learn from, and this project isn't related to statistical trends. Dropping where we can is good because it will help with computation speed.

###Plotting our example classes

###Preprocessing Procedure

As of now, we plan to normalize the grey scale values within each entry's pixel column. This will make the pixel column more readable, and less greyscale intensity with numerous images may help with overall speed. We also plan to get an even amount of emotions from each category, 547, in order to ensure our model doesn't become bias to seeing one emotion too much and to also have less images within our dataset.