# Collect data

### Changing working directory

- First we need to change the working directory that´s going to be used
- By running the cell below the current directory will be shown:

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fredd\\Desktop\\Studier\\Project5\\Uboats_in_ww2\\Jupiter_notebooks'

So now we need to change the working directory to the parent of this directory by running the cell below

In [5]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()

print(f"You are now working in the following directory: {current_dir}")

You are now working in the following directory: c:\Users\fredd\Desktop\Studier




### Collect data from kaggle
- First download your kaggle.json file by clicking on your profile picture in the top right corner and then on "settings" 
<br><br><img src="../images/screenshots/kaggle.png"><br><br>
- Then press "Create new token" under API 
- When you´ve downloaded the `kaggle.json` file, drag it in to your projects main folder.
<br><br><img src="../images/screenshots/kaggle2.png"><br><br>
- Once that is done we can run the cell below, so the token is recognized in the session.

In [14]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

os.getcwd()

'c:\\Users\\fredd\\Desktop\\Studier\\Project 5\\Uboats_in_ww2'

Now let´s download the dataset from kaggle

I want to work with this dataset https://www.kaggle.com/datasets/cormac42/ww2-u-boats

- By running the cell below we define where the dataset will be downloaded to.
- In this case it will be `inputs/datasets/raw` where we store the raw datasets that will be worked on. So if that´s a chance to start over if a problem occur.

In [15]:
KaggleDatasetPath = "cormac42/ww2-u-boats"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/cormac42/ww2-u-boats
License(s): CC-BY-SA-4.0
Downloading ww2-u-boats.zip to inputs/datasets/raw




  0%|          | 0.00/102k [00:00<?, ?B/s]
100%|██████████| 102k/102k [00:00<00:00, 358kB/s]
100%|██████████| 102k/102k [00:00<00:00, 357kB/s]


Now we need to unzip the data file and delete the zip file and kaggle.json since those files won´t be needed anymore

In [16]:
import zipfile
import glob
import os

DestinationFolder = "inputs/datasets/raw"
KaggleJsonPath = "kaggle.json"

# This will find the zip-file in the DestinationFolder and unåack it
for zip_path in glob.glob(f"{DestinationFolder}/*.zip"):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_path)  # Removes zip-file after unpacking

if os.path.exists(KaggleJsonPath):
    os.remove(KaggleJsonPath) # This will remove the file

Let´s load the data and inspect

In [3]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/uboats.csv")
df.head()

Unnamed: 0,Name,Year,Type,Notable Commanders,Warships_sunk_n_total_loss_No,Warships_sunk_n_total_loss_Tons-n-GRT,Warships_Damaged_No,Warships_Damaged_Tons-n-GRT,Merchant_Ships_sunk_No,Merchant_Ships_sunk_GRT,...,Notes,URL,Commissioned,Patrols,Patrols_Count,Wolfpacks,Wolfpacks_Count,Flotilla,Flotilla_Count,Last_Flotilla
0,U-1,1935,IIA,Klaus Ewerth,0,0,0,0,0,0,...,Struck a mine,https://en.wikipedia.org/wiki/German_submarine...,1935-06-29,2 patrols:1st patrol:15 – 29 March 19402nd pat...,2,,0,"{'U-boat School Flotilla': ['1 July 1935 ', ' ...",1,U-boat School Flotilla
1,U-2,1935,IIA,"Hans Heidtmann,Heinrich Liebe,Helmut Rosenbaum...",0,0,0,0,0,0,...,Training boat,https://en.wikipedia.org/wiki/German_submarine...,1935-07-25,2 patrols:1st patrol:15 – 29 March 19402nd pat...,2,,0,"{'U-boat School Flotilla': ['1 July 1935 ', ' ...",2,21st U-boat Flotilla
2,U-3,1935,IIA,"Joachim Schepke,Otto von Bülow,Hans-Hartwig Tr...",0,0,0,0,2,2348,...,,https://en.wikipedia.org/wiki/German_submarine...,1935-09-06,5 patrols:1st patrol:4 – 8 September 19392nd p...,5,,0,"{'U-boat School Flotilla': ['1 August 1935 ', ...",2,21st U-boat Flotilla
3,U-4,1935,IIA,Heinz-Otto Schultze,1,1090,0,0,3,5133,...,,https://en.wikipedia.org/wiki/German_submarine...,1935-08-17,4 patrols:1st patrol:4 – 14 September 19392nd ...,4,,0,"{'U-boat School Flotilla': ['1 August 1935 ', ...",2,21st U-boat Flotilla
4,U-5,1935,IIA,Heinrich Lehmann-Willenbrock,0,0,0,0,0,0,...,Accident,https://en.wikipedia.org/wiki/German_submarine...,1935-08-31,2 patrols:1st patrol:24 August – 8 September 1...,2,,0,{'U-boat School Flotilla': ['1 September 1935 ...,2,21st U-boat Flotilla


Now let´s look in to the summary of the dataset:

In [18]:
df.info

<bound method DataFrame.info of         Name  Year   Type                                 Notable Commanders  \
0        U-1  1935    IIA                                       Klaus Ewerth   
1        U-2  1935    IIA  Hans Heidtmann,Heinrich Liebe,Helmut Rosenbaum...   
2        U-3  1935    IIA  Joachim Schepke,Otto von Bülow,Hans-Hartwig Tr...   
3        U-4  1935    IIA                                Heinz-Otto Schultze   
4        U-5  1935    IIA                       Heinrich Lehmann-Willenbrock   
...      ...   ...    ...                                                ...   
1148  U-4707  1945  XXIII                                                NaN   
1149  U-4709  1945  XXIII                                                NaN   
1150  U-4710  1945  XXIII                              Ludwig von Friedeburg   
1151  U-4711  1945  XXIII                                                NaN   
1152  U-4712  1945  XXIII                                                NaN   

      W

This dataset has 26 columns and 1153 rows with data. <br>
Since the names of the uboats are unique, let´s take a look if there are some duplicates in that column by running the cell below.<br>
Hopefully there won´t be any duplicates shown!

In [19]:
df[df.duplicated(subset=['Name'])]

Unnamed: 0,Name,Year,Type,Notable Commanders,Warships_sunk_n_total_loss_No,Warships_sunk_n_total_loss_Tons-n-GRT,Warships_Damaged_No,Warships_Damaged_Tons-n-GRT,Merchant_Ships_sunk_No,Merchant_Ships_sunk_GRT,...,Notes,URL,Commissioned,Patrols,Patrols_Count,Wolfpacks,Wolfpacks_Count,Flotilla,Flotilla_Count,Last_Flotilla


In [20]:
df.dtypes

Name                                     object
Year                                      int64
Type                                     object
Notable Commanders                       object
Warships_sunk_n_total_loss_No             int64
Warships_sunk_n_total_loss_Tons-n-GRT    object
Warships_Damaged_No                       int64
Warships_Damaged_Tons-n-GRT              object
Merchant_Ships_sunk_No                    int64
Merchant_Ships_sunk_GRT                  object
Merchant_Ships_damaged_No                 int64
Merchant_Ships_damaged_GRT               object
Merchant_Ships_total_loss_No              int64
Merchant_Ships_total_loss_GRT            object
Fate_Event                               object
Fate_Date                                object
Notes                                    object
URL                                      object
Commissioned                             object
Patrols                                  object
Patrols_Count                           