# IFN645 Case Study 2
## Mining from Manufacturing, Supermarket, News Stories and Web Log Data

### Contents
1. [Clustering & Pre-processing](#clust)
2. [Association Mining](#association)
3. [Text Mining](#text)
4. [Web Mining](#web)

---
## Part 1: Clustering Pre-processing and K-means analysis<a name="clust"></a>
### 1. Can you identify data quality issues in this dataset such as unusual data types, missing values, etc?
In the process of importing the data, the dataframe.info() method can be used to evaluate the dataset.

In [12]:
import pandas as pd
# Import Data from csv without skipping empty cells
df = pd.read_csv('Casestudy2-Data-Py/model_car_sales.csv', na_filter=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 8 columns):
LOCATION_NUMBER    675 non-null int64
REPORT_DATE        675 non-null object
DEALER_CODE        675 non-null object
UTE                675 non-null object
HATCH              675 non-null object
WAG0N              675 non-null object
SEDAN              675 non-null object
K__SALES_TOT       675 non-null object
dtypes: int64(1), object(7)
memory usage: 42.3+ KB


From the `info()` output, we can identify the following issues

#### Unusual Datatypes
<table>
<tr>
<th>Variable Name</th>
<th>Current Datatype</th>
<th>Desired Datatype</th>
</tr>
<tr>
<td>HATCH</td>
<td>Object</td>
<td>int64</td>
</tr>
<tr>
<td>SEDAN</td>
<td>Object</td>
<td>int64</td>
</tr>
<tr>
<td>WAGON</td>
<td>Object</td>
<td>int64</td>
</tr>
<tr>
<td>UTE</td>
<td>Object</td>
<td>int64</td>
</tr>
<tr>
<td>K_SALES_TOT </td>
<td>Object</td>
<td>int64</td>
</tr>
</table>
     
For the given data description, the fields `UTE`, `HATCH`,`SEDAN`, `WAG0N` and `K_SALES_TOT` should be interval/numerical values as opposed to objects.
By using the `.describe()` function, we may be able to uncover the source of the issues in the dataset

In [3]:
# print details for all variables in dataframe
for cols in df:
    print(df[cols].describe())
    print("-"*20)

count    675.0
mean     338.0
std      195.0
min        1.0
25%      169.5
50%      338.0
75%      506.5
max      675.0
Name: LOCATION_NUMBER, dtype: float64
--------------------
count            675
unique             1
top       2013-04-30
freq             675
Name: REPORT_DATE, dtype: object
--------------------
count          675
unique         675
top       Euro-531
freq             1
Name: DEALER_CODE, dtype: object
--------------------
count     675
unique    143
top          
freq       22
Name: UTE, dtype: object
--------------------
count     675
unique    518
top          
freq       22
Name: HATCH, dtype: object
--------------------
count     675
unique    426
top          
freq       22
Name: WAG0N, dtype: object
--------------------
count     675
unique    501
top          
freq       22
Name: SEDAN, dtype: object
--------------------
count     675
unique    109
top       932
freq       25
Name: K__SALES_TOT, dtype: object
--------------------


In [4]:
# Check for cause of issues in one of the variables
print(df['UTE'].value_counts())

       22
81     15
92     15
90     12
80     12
100    11
72     11
106    11
83     10
97     10
84     10
70      9
93      9
69      9
91      9
77      9
82      9
75      9
99      9
116     9
88      9
68      9
73      9
98      8
85      8
74      8
89      8
78      8
66      8
114     8
       ..
157     1
146     1
37      1
131     1
150     1
143     1
155     1
40      1
202     1
190     1
23      1
180     1
39      1
198     1
166     1
41      1
142     1
206     1
46      1
52      1
178     1
191     1
153     1
197     1
8       1
31      1
9       1
196     1
209     1
173     1
Name: UTE, Length: 143, dtype: int64


In [10]:
# See Rows where UTE contains empty string
print(df[df['UTE']==''].as_matrix())

[[4 '2013-04-30' 'Euro-103' '' '' '' '' '']
 [24 '2013-04-30' 'Euro-123' '' '' '' '' '']
 [50 '2013-04-30' 'Euro-149' '' '' '' '' '']
 [108 '2013-04-30' 'Euro-201' '' '' '' '' '']
 [173 '2013-04-30' 'Euro-260' '' '' '' '' '']
 [174 '2013-04-30' 'Euro-261' '' '' '' '' '']
 [175 '2013-04-30' 'Euro-262' '' '' '' '' '']
 [176 '2013-04-30' 'Euro-263' '' '' '' '' '']
 [177 '2013-04-30' 'Euro-264' '' '' '' '' '']
 [198 '2013-04-30' 'Euro-283' '' '' '' '' '']
 [199 '2013-04-30' 'Euro-284' '' '' '' '' '']
 [200 '2013-04-30' 'Euro-285' '' '' '' '' '']
 [298 '2013-04-30' 'Euro-374' '' '' '' '' '']
 [299 '2013-04-30' 'Euro-375' '' '' '' '' '']
 [300 '2013-04-30' 'Euro-376' '' '' '' '' '']
 [643 '2013-04-30' 'Euro-688' '' '' '' '' '']
 [644 '2013-04-30' 'Euro-689' '' '' '' '' '']
 [645 '2013-04-30' 'Euro-69' '' '' '' '' '']
 [646 '2013-04-30' 'Euro-70' '' '' '' '' '']
 [665 '2013-04-30' 'Euro-89' '' '' '' '' '']
 [666 '2013-04-30' 'Euro-90' '' '' '' '' '']
 [667 '2013-04-30' 'Euro-91' '' '' '' '' '

#### Missing Values
Missing Values are present int he following Variables:
<table>
    <tr>
        <th>Variable Name</th>
        <th># Missing Values</th>
    </tr>
    <tr>
        <td>HATCH</td>
        <td>22</td>
    </tr>
    <tr>
        <td>SEDAN</td>
        <td>22</td>
    </tr>
    <tr>
        <td>WAGON</td>
        <td>22</td>
    </tr>
    <tr>
        <td>UTE</td>
        <td>22</td>
    </tr>
    <tr>
        <td>K_SALES_TOT</td>
        <td>22</td>
    </tr>
</table>

---
## Part 2: Association Mining and it's data Pre-processing<a name="association"></a>





---
## Part 3: Text Mining<a name="text"></a>





---
## Part 4: Web Mining<a name="web"></a>



