# Flickr30k dataset analysation

### 0. Library and modules imports

In [1]:
import datacleaner
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import dask.dataframe as dd

1. Load the dataset (we will add a proper download link once the dataset ready)

In [2]:
df = pd.read_csv('flickr/results.csv', sep = '|')
df.columns = ['image_name', 'comment_number', 'comment']

In [3]:
df.head()

Unnamed: 0,image_name,comment_number,comment
0,1000092795.jpg,0,Two young guys with shaggy hair look at their...
1,1000092795.jpg,1,"Two young , White males are outside near many..."
2,1000092795.jpg,2,Two men in green shirts are standing in a yard .
3,1000092795.jpg,3,A man in a blue shirt standing in a garden .
4,1000092795.jpg,4,Two friends enjoy time spent together .


### 2. Dataset analysation 

In [4]:
number_of_images = len(df.image_name.unique())
col_types = df.dtypes
print('The dataset contains {} images'.format(number_of_images))
print('The column types are: ') 
print(col_types)


The dataset contains 31783 images
The column types are: 
image_name        object
comment_number    object
comment           object
dtype: object


In [5]:
df.image_name.value_counts().unique()

array([5])

Every image contains the same amount of captions and thus, should be clean in this area. It is important to get rid of the non-alphanumeric characters.

In [6]:
res = 0
for index,elem in df.iterrows():
    if isinstance(elem.comment, str):  
        continue
    else:
        res = index

In [7]:
df.iloc[res]

image_name                            2199200615.jpg
comment_number     4   A dog runs across the grass .
comment                                          NaN
Name: 19999, dtype: object

There seems to be a mixup for this element. We suggest eliminating this picture from the dataset. 

A hidden float object is in the dataset. We will take it out 

In [8]:
df.drop(df.loc[df.image_name == df.image_name.iloc[res]].index, inplace = True)

In [9]:
df_cleaned = df.copy()
df_cleaned['cleaned_captions'] = df_cleaned.comment.apply(lambda x: datacleaner.remove_non_alphanumeric(x))
df_cleaned['cleaned_captions'] = df_cleaned['cleaned_captions'].apply(lambda x: datacleaner.lower_caption(x))
df_cleaned['tokenized'] = df_cleaned.cleaned_captions.apply(lambda x: x.split())
df_cleaned['no_stopwords'] = df_cleaned.tokenized.apply(lambda x: datacleaner.delete_stopwords(x))

In [10]:
df_cleaned.tokenized

0         [two, young, guys, with, shaggy, hair, look, a...
1         [two, young, white, males, are, outside, near,...
2         [two, men, in, green, shirts, are, standing, i...
3         [a, man, in, a, blue, shirt, standing, in, a, ...
4              [two, friends, enjoy, time, spent, together]
                                ...                        
158910    [a, man, in, shorts, and, a, hawaiian, shirt, ...
158911    [a, young, man, hanging, over, the, side, of, ...
158912    [a, man, is, leaning, off, of, the, side, of, ...
158913    [a, man, riding, a, small, boat, in, a, harbor...
158914    [a, man, on, a, moored, blue, and, white, boat...
Name: tokenized, Length: 158910, dtype: object

In [11]:
df_cleaned.no_stopwords

0         [two, young, guys, shaggy, hair, look, hands, ...
1         [two, young, white, males, outside, near, many...
2                 [two, men, green, shirts, standing, yard]
3                      [man, blue, shirt, standing, garden]
4              [two, friends, enjoy, time, spent, together]
                                ...                        
158910    [man, shorts, hawaiian, shirt, leans, rail, pi...
158911    [young, man, hanging, side, boat, like, fog, r...
158912    [man, leaning, side, blue, white, boat, sits, ...
158913    [man, riding, small, boat, harbor, fog, mounta...
158914    [man, moored, blue, white, boat, hills, mist, ...
Name: no_stopwords, Length: 158910, dtype: object

We did a little preparation on the captions to assure a proper analysis. 

We will now apply a bag of word.

In [12]:
bow = CountVectorizer(stop_words='english', lowercase=True, min_df = 5)
X = bow.fit_transform(df_cleaned.cleaned_captions).toarray()

df_bow = pd.DataFrame(X, columns = bow.get_feature_names())
len(df_bow.colmns)



AttributeError: 'DataFrame' object has no attribute 'colmns'

We have 7453 words that have a frequency of at least 5. 