# Smartathon - Exploratory Data Analysis

To begin, we import the following libraries: 
- NumPy: for data manipulation
- Pandas: for data manipulation
- mlxtend: for association rule mining

In [1]:
# importing the libraries
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

Next, we read the dataset using the read_csv() function from the Pandas library.

In [2]:
# reading the dataset
df = pd.read_csv('/kaggle/input/smartathon-dataset-zip/dataset/train.csv')

To check the structure of the dataset, we view its first few rows using the head() function from the Pandas library.

In [3]:
# viewing the first few rows
df.head()

Unnamed: 0,class,image_path,name,xmax,xmin,ymax,ymin
0,3.0,4a48c42c9579ec0399e6c5a3e825e765.jpg,GARBAGE,797.0,701.0,262.0,211.0
1,3.0,4a48c42c9579ec0399e6c5a3e825e765.jpg,GARBAGE,932.0,786.0,329.0,238.0
2,3.0,4a48c42c9579ec0399e6c5a3e825e765.jpg,GARBAGE,736.0,657.0,275.0,229.0
3,7.0,ea906a663da6321bcef78be4b7d1afff.jpg,BAD_BILLBOARD,986.0,786.0,136.0,0.0
4,8.0,1c7d48005a12d1b19261b8e71df7cafe.jpg,SAND_ON_ROAD,667.0,549.0,228.0,179.0


Let's check out which class of visual pollution is the most prevelant. To do that, we use the value_counts() function from the Pandas library on the 'name' column of the dataframe.

In [4]:
# checking which class of visual pollution is the most prevelant
df.name.value_counts()

GARBAGE              8597
CONSTRUCTION_ROAD    2730
POTHOLES             2625
CLUTTER_SIDEWALK     2253
BAD_BILLBOARD        1555
GRAFFITI             1124
SAND_ON_ROAD          748
UNKEPT_FACADE         127
FADED_SIGNAGE         107
BROKEN_SIGNAGE         83
BAD_STREETLIGHT         1
Name: name, dtype: int64

As we can see, the most prevelant type of visual pollution is Garbage followed by Construction Road.

Now let's check out whether Garbage occurs more individually or with other categories. To do that, we count the number of images in which GARBAGE occurs individually and the number of images associated with another category.

In [5]:
# counting number of images associated only with garbage
sg = 0
for k in list(df.groupby(by='image_path').name.value_counts().loc[lambda x: x==1].keys()):
    if k[1]=='GARBAGE':
        sg = sg+1

# counting number of instances associated with garbage and other categories
wg = 0
for k in list(df.groupby(by='image_path').name.value_counts().loc[lambda x: x>1].keys()):
    if k[1]=='GARBAGE':
        wg = wg+1

print('Only Garbage: ', sg)
print('Garbage with Other Categories: ', wg)

Only Garbage:  1787
Garbage with Other Categories:  2038


As we can see, Garbage occurs more commonly with other categories. Now let's find out which are those categories using Association Rule Mining.

In [6]:
# finding categories most commonly associated with garbage
arm = []
for i in list(df.image_path.unique()):
    arm.append(list(df.loc[df.image_path==i, 'name'].unique()))
encoder = TransactionEncoder()
transactions = encoder.fit(arm).transform(arm)
transactions = transactions.astype('int')
arm = pd.DataFrame(transactions, columns=encoder.columns_)
frequent_itemsets = apriori(arm, use_colnames=True, min_support=0.01)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
print(frequent_itemsets[frequent_itemsets.length>1].sort_values(by='support', ascending=False))

     support                      itemsets  length
9   0.049022   (GARBAGE, CLUTTER_SIDEWALK)       2
12  0.026543           (GARBAGE, POTHOLES)       2
13  0.023241       (GARBAGE, SAND_ON_ROAD)       2
8   0.022733      (GARBAGE, BAD_BILLBOARD)       2
11  0.018542           (GARBAGE, GRAFFITI)       2
10  0.018415  (GARBAGE, CONSTRUCTION_ROAD)       2




As we can see, the most common category associated with garbage is clutter_sidewalk. 

Next, we assume that the images are labelled in such a way that they represent the city they are captured in. To further use this assumption, we extract the first characted of the image_path and randomly allocate a city name to it.

In [7]:
# extracting the first character
df['st'] = df.image_path.apply(lambda x: x[0])

# creating a dictionary of the first character and allocating them to cities
cities = {
    '8' : 'Riyadh',
    'e' : 'Jubail',
    'b' : 'Jeddah',
    'f' : 'Al Khubar',
    'd' : 'Dammam',
    '9' : 'Ghran',
    'a' : 'Madina',
    '5' : 'Tabuk',
    '1' : 'Taif',
    'c' : 'Al Jubail',
    '3' : 'Hail',
    '7' : 'Najran',
    '6' : 'Al Hofuf',
    '2' : 'Khamis Mushait',
    '0' : 'Dhahran',
    '4' : 'Yanbu'
}

# mapping cities to image path
def city_map(x):
    return cities[x]
df['location'] = df['st'].apply(city_map)
df = df.drop(columns='st')

Next, we check which city has the highest prevelance of pollution. To do this, we use the value_counts() function from the Pandas library.

In [8]:
# checking location-wise prevelance of pollution
df.location.value_counts()

Riyadh            1417
Jubail            1387
Jeddah            1341
Al Khubar         1326
Dammam            1279
Ghran             1271
Madina            1260
Tabuk             1233
Taif              1231
Al Jubail         1222
Hail              1209
Najran            1200
Al Hofuf          1158
Khamis Mushait    1149
Dhahran           1145
Yanbu             1122
Name: location, dtype: int64

As we can see, according to our assumption, Riyadh has the highest level of pollution.

Next, we check for location-wise prevelance of types of pollution. To do this, we create a contingency table using the crosstab() function from the Pandas library.

In [9]:
# checking the location-wise prevelance of types of pollution
pd.crosstab(df['location'], df['name'])

name,BAD_BILLBOARD,BAD_STREETLIGHT,BROKEN_SIGNAGE,CLUTTER_SIDEWALK,CONSTRUCTION_ROAD,FADED_SIGNAGE,GARBAGE,GRAFFITI,POTHOLES,SAND_ON_ROAD,UNKEPT_FACADE
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Al Hofuf,78,0,5,141,177,8,500,57,157,29,6
Al Jubail,107,0,1,139,147,7,541,59,181,38,2
Al Khubar,98,0,0,155,179,4,580,76,171,54,9
Dammam,80,0,5,180,188,4,559,78,138,41,6
Dhahran,83,0,9,125,158,6,464,81,174,39,6
Ghran,108,0,4,138,193,10,519,48,193,51,7
Hail,114,0,6,132,135,6,550,63,155,36,12
Jeddah,103,0,1,138,170,11,628,86,142,48,14
Jubail,116,0,6,201,186,4,597,55,163,50,9
Khamis Mushait,100,0,0,119,148,8,516,65,147,40,6


Next, we check for association of garbage with other types of pollution using association rule mining location-wise.

In [10]:
for lc in df.location.unique():
    print('Location: ', lc)
    
    # counting number of images associated only with garbage
    sg = 0
    for k in list(df[df.location==lc].groupby(by='image_path').name.value_counts().loc[lambda x: x==1].keys()):
        if k[1]=='GARBAGE':
            sg = sg+1
    # counting number of instances associated with garbage and other categories
    wg = 0
    for k in list(df[df.location==lc].groupby(by='image_path').name.value_counts().loc[lambda x: x>1].keys()):
        if k[1]=='GARBAGE':
            wg = wg+1
    print('Only Garbage: ', sg)
    print('Garbage with Other Categories: ', wg)
    print('Most Frequently Occuring Itemsets: ')
    
    # finding categories most commonly associated with garbage
    arm = []
    for i in list(df[df.location==lc].image_path.unique()):
        arm.append(list(df[df.location==lc].loc[df.image_path==i, 'name'].unique()))
    encoder = TransactionEncoder()
    transactions = encoder.fit(arm).transform(arm)
    transactions = transactions.astype('int')
    arm = pd.DataFrame(transactions, columns=encoder.columns_)
    frequent_itemsets = apriori(arm, use_colnames=True, min_support=0.01)
    frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
    print(frequent_itemsets[frequent_itemsets.length>1])

Location:  Yanbu
Only Garbage:  108
Garbage with Other Categories:  115
Most Frequently Occuring Itemsets: 




     support                            itemsets  length
7   0.010870  (BAD_BILLBOARD, CONSTRUCTION_ROAD)       2
8   0.023913            (GARBAGE, BAD_BILLBOARD)       2
9   0.039130         (GARBAGE, CLUTTER_SIDEWALK)       2
10  0.015217                 (GARBAGE, GRAFFITI)       2
11  0.030435                 (GARBAGE, POTHOLES)       2
12  0.026087             (GARBAGE, SAND_ON_ROAD)       2
Location:  Jubail
Only Garbage:  139
Garbage with Other Categories:  137
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
8   0.031776      (GARBAGE, BAD_BILLBOARD)       2
9   0.067290   (GARBAGE, CLUTTER_SIDEWALK)       2
10  0.022430  (GARBAGE, CONSTRUCTION_ROAD)       2
11  0.014953           (GARBAGE, GRAFFITI)       2
12  0.024299           (GARBAGE, POTHOLES)       2
13  0.031776       (GARBAGE, SAND_ON_ROAD)       2
Location:  Taif
Only Garbage:  124
Garbage with Other Categories:  124
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
8   0.014141      (GARBAGE, BAD_BILLBOARD)       2
9   0.058586   (GARBAGE, CLUTTER_SIDEWALK)       2
10  0.022222  (GARBAGE, CONSTRUCTION_ROAD)       2
11  0.030303           (GARBAGE, GRAFFITI)       2
12  0.020202           (GARBAGE, POTHOLES)       2
13  0.020202       (GARBAGE, SAND_ON_ROAD)       2
Location:  Riyadh
Only Garbage:  121
Garbage with Other Categories:  142
Most Frequently Occuring Itemsets: 




     support                            itemsets  length
8   0.018149  (BAD_BILLBOARD, CONSTRUCTION_ROAD)       2
9   0.041742            (GARBAGE, BAD_BILLBOARD)       2
10  0.032668         (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.018149        (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.014519                 (GARBAGE, GRAFFITI)       2
13  0.027223                 (GARBAGE, POTHOLES)       2
14  0.019964             (GARBAGE, SAND_ON_ROAD)       2
Location:  Al Jubail
Only Garbage:  111
Garbage with Other Categories:  119
Most Frequently Occuring Itemsets: 




     support                            itemsets  length
8   0.010593  (BAD_BILLBOARD, CONSTRUCTION_ROAD)       2
9   0.010593            (GARBAGE, BAD_BILLBOARD)       2
10  0.050847         (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.019068        (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.012712                 (GARBAGE, GRAFFITI)       2
13  0.036017                 (GARBAGE, POTHOLES)       2
14  0.012712             (GARBAGE, SAND_ON_ROAD)       2
Location:  Najran
Only Garbage:  108
Garbage with Other Categories:  121
Most Frequently Occuring Itemsets: 




     support                            itemsets  length
10  0.010438  (BAD_BILLBOARD, CONSTRUCTION_ROAD)       2
11  0.022965            (GARBAGE, BAD_BILLBOARD)       2
12  0.035491         (GARBAGE, CLUTTER_SIDEWALK)       2
13  0.027140        (GARBAGE, CONSTRUCTION_ROAD)       2
14  0.022965                 (GARBAGE, GRAFFITI)       2
15  0.035491                 (GARBAGE, POTHOLES)       2
16  0.025052             (GARBAGE, SAND_ON_ROAD)       2
17  0.010438                (GRAFFITI, POTHOLES)       2
Location:  Al Khubar
Only Garbage:  112
Garbage with Other Categories:  137
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
7   0.017682      (GARBAGE, BAD_BILLBOARD)       2
8   0.045187   (GARBAGE, CLUTTER_SIDEWALK)       2
9   0.029470  (GARBAGE, CONSTRUCTION_ROAD)       2
10  0.021611           (GARBAGE, GRAFFITI)       2
11  0.025540           (GARBAGE, POTHOLES)       2
12  0.029470       (GARBAGE, SAND_ON_ROAD)       2
13  0.011788      (POTHOLES, SAND_ON_ROAD)       2
Location:  Jeddah
Only Garbage:  116
Garbage with Other Categories:  147
Most Frequently Occuring Itemsets: 




     support                            itemsets  length
9   0.011407  (BAD_BILLBOARD, CONSTRUCTION_ROAD)       2
10  0.022814            (GARBAGE, BAD_BILLBOARD)       2
11  0.053232         (GARBAGE, CLUTTER_SIDEWALK)       2
12  0.011407        (GARBAGE, CONSTRUCTION_ROAD)       2
13  0.015209                 (GARBAGE, GRAFFITI)       2
14  0.020913                 (GARBAGE, POTHOLES)       2
15  0.028517             (GARBAGE, SAND_ON_ROAD)       2
Location:  Ghran
Only Garbage:  103
Garbage with Other Categories:  126
Most Frequently Occuring Itemsets: 




     support                            itemsets  length
8   0.010225  (BAD_BILLBOARD, CONSTRUCTION_ROAD)       2
9   0.020450            (GARBAGE, BAD_BILLBOARD)       2
10  0.044990         (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.016360        (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.034765                 (GARBAGE, POTHOLES)       2
13  0.018405             (GARBAGE, SAND_ON_ROAD)       2
14  0.010225            (POTHOLES, SAND_ON_ROAD)       2
Location:  Al Hofuf
Only Garbage:  95
Garbage with Other Categories:  123
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
9   0.019523      (GARBAGE, BAD_BILLBOARD)       2
10  0.065076   (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.013015  (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.017354           (GARBAGE, GRAFFITI)       2
13  0.034707           (GARBAGE, POTHOLES)       2
14  0.023861       (GARBAGE, SAND_ON_ROAD)       2
Location:  Hail
Only Garbage:  130
Garbage with Other Categories:  136
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
9   0.025641      (GARBAGE, BAD_BILLBOARD)       2
10  0.051282   (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.015779  (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.025641           (GARBAGE, GRAFFITI)       2
13  0.025641           (GARBAGE, POTHOLES)       2
14  0.023669       (GARBAGE, SAND_ON_ROAD)       2
Location:  Dammam
Only Garbage:  114
Garbage with Other Categories:  140
Most Frequently Occuring Itemsets: 




     support                       itemsets  length
7   0.013540       (GARBAGE, BAD_BILLBOARD)       2
8   0.054159    (GARBAGE, CLUTTER_SIDEWALK)       2
9   0.013540   (GARBAGE, CONSTRUCTION_ROAD)       2
10  0.011605  (POTHOLES, CONSTRUCTION_ROAD)       2
11  0.019342            (GARBAGE, GRAFFITI)       2
12  0.013540            (GARBAGE, POTHOLES)       2
13  0.023211        (GARBAGE, SAND_ON_ROAD)       2
Location:  Madina
Only Garbage:  106
Garbage with Other Categories:  116
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
9   0.026369      (GARBAGE, BAD_BILLBOARD)       2
10  0.032454   (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.026369  (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.022312           (GARBAGE, GRAFFITI)       2
13  0.016227           (GARBAGE, POTHOLES)       2
14  0.018256       (GARBAGE, SAND_ON_ROAD)       2
Location:  Khamis Mushait
Only Garbage:  88
Garbage with Other Categories:  118
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
8   0.018018      (GARBAGE, BAD_BILLBOARD)       2
9   0.040541   (GARBAGE, CLUTTER_SIDEWALK)       2
10  0.015766  (GARBAGE, CONSTRUCTION_ROAD)       2
11  0.022523           (GARBAGE, GRAFFITI)       2
12  0.027027           (GARBAGE, POTHOLES)       2
13  0.022523       (GARBAGE, SAND_ON_ROAD)       2
Location:  Tabuk
Only Garbage:  109
Garbage with Other Categories:  140
Most Frequently Occuring Itemsets: 




     support                      itemsets  length
10  0.026210      (GARBAGE, BAD_BILLBOARD)       2
11  0.052419   (GARBAGE, CLUTTER_SIDEWALK)       2
12  0.016129  (GARBAGE, CONSTRUCTION_ROAD)       2
13  0.022177           (GARBAGE, GRAFFITI)       2
14  0.020161           (GARBAGE, POTHOLES)       2
15  0.028226       (GARBAGE, SAND_ON_ROAD)       2
Location:  Dhahran
Only Garbage:  103
Garbage with Other Categories:  97
Most Frequently Occuring Itemsets: 
     support                      itemsets  length
9   0.025000      (GARBAGE, BAD_BILLBOARD)       2
10  0.061364   (GARBAGE, CLUTTER_SIDEWALK)       2
11  0.020455  (GARBAGE, CONSTRUCTION_ROAD)       2
12  0.013636           (GARBAGE, GRAFFITI)       2
13  0.036364           (GARBAGE, POTHOLES)       2
14  0.018182       (GARBAGE, SAND_ON_ROAD)       2


