<a href="https://colab.research.google.com/github/T-Sunm/Learn-Data-Cleaning-in-kaggle/blob/main/Exercise_Character_Encodings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'fatal-police-shootings-in-the-us:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F2647%2F4395%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240909%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240909T153452Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D4f852067ec4ac0f32d1a1537110251c569a74467ca4ea49d948fbf885da72157ccd5202af3e122e5d4995b052c26660466f491b93b53582a14caed027bc166080aed9c2c0c3b588022505dd6ee7da359d4bffdc75556baf71e713ed122f9f292654ecc8524f07e1940585d503cf50b5cf16d8ab78949db6aca97e9b77d6a8b4776d17ed2cffd0558eb275e803dc3e09e743d7335c21f91068642f1f6414e2cbedbe7bd48482d8801bb8891c4c4fbcc4bd8e54ea8a6d8b6bb7bb187ff6664f9451a0f3f0dd4df93cd70db3890d03c0a5e32d1ac93ca01b6b4773b3c96391c812dee3187e811ccf633ef99fbdec9f29222f433a0d67b4da00a38e9ba657cd71747'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


**This notebook is an exercise in the [Data Cleaning](https://www.kaggle.com/learn/data-cleaning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/character-encodings).**

---


In this exercise, you'll apply what you learned in the **Character encodings** tutorial.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex4 import *
print("Setup Complete")

# Get our environment set up

The first thing we'll need to do is load in the libraries we'll be using.

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import charset_normalizer

# set seed for reproducibility
np.random.seed(0)

# 1) What are encodings?

You're working with a dataset composed of bytes.  Run the code cell below to print a sample entry.

In [None]:
sample_entry = b'\xa7A\xa6n'
print(sample_entry)
print('data type:', type(sample_entry))

You notice that it doesn't use the standard UTF-8 encoding.

Use the next code cell to create a variable `new_entry` that changes the encoding from `"big5-tw"` to `"utf-8"`.  `new_entry` should have the bytes datatype.

In [None]:
new_entry = sample_entry.decode("big5-tw", errors="replace").encode("utf-8")

# Check your answer
q1.check()

In [None]:
# Lines below will give you a hint or solution code
#q1.hint()
#q1.solution()

# 2) Reading in files with encoding problems

Use the code cell below to read in this file at path `"../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv"`.  

Figure out what the correct encoding should be and read in the file to a DataFrame `police_killings`.

In [None]:
with open("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", 'rb') as rawdata:
    result = charset_normalizer.detect(rawdata.read(100000))

print(result)

In [None]:
# TODO: Load in the DataFrame correctly.
police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", encoding = "Windows-1252")

# Check your answer
q2.check()

Feel free to use any additional code cells for supplemental work.  To get credit for finishing this question, you'll need to run `q2.check()` and get a result of **Correct**.

In [None]:
# (Optional) Use this code cell for any additional work.

In [None]:
# Lines below will give you a hint or solution code
# q2.hint()
# q2.solution()

# 3) Saving your files with UTF-8 encoding

Save a version of the police killings dataset to CSV with UTF-8 encoding.  Your answer will be marked correct after saving this file.  

Note: When using the `to_csv()` method, supply only the name of the file (e.g., `"my_file.csv"`).  This saves the file at the filepath `"/kaggle/working/my_file.csv"`.

In [None]:
# TODO: Save the police killings dataset to CSV
police_killings.to_csv("police_killings-utf8.csv")

# Check your answer
q3.check()

In [None]:
# Lines below will give you a hint or solution code
#q3.hint()
# q3.solution()

# (Optional) More practice

Check out [this dataset of files in different character encodings](https://www.kaggle.com/rtatman/character-encoding-examples). Can you read in all the files with their original encodings and them save them out as UTF-8 files?

If you have a file that's in UTF-8 but has just a couple of weird-looking characters in it, you can try out the [ftfy module](https://ftfy.readthedocs.io/en/latest/#) and see if it helps.

# Keep going

In the final lesson, learn how to [**clean up inconsistent text entries**](https://www.kaggle.com/alexisbcook/inconsistent-data-entry) in your dataset.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/data-cleaning/discussion) to chat with other learners.*