<center>
    <h1 id='character-encoding' style='color:#7159c1'>🔨 Character Encoding 🔨</h1>
    <i>Dealing with Different Charsets</i>
</center>

<br />

When you read a csv file that's not in `UTF-8` charset, you'll get an error like this one:

> **UnicodeDecodeError** - `'utf-8' codec can't decode byte 0x99 in position 7955: invalid start byte`.

To solve this, you gotta convert the file to UTF-8 following the steps bellow:

1 - find out the file's charset;

2 - read the file with the correct charset;

3 - save the file with pandas (UTF-8 is the default charset to pandas).

In [1]:
# ---- Settings ----
import pandas as pd # pip install pandas
import chardet # pip install chardet

In [2]:
# ---- Guessing Chardet ----
#
# - reading 10,000 bytes, the charset encoding is probably UTF-8
# with 75% oof confidence
#
with open('./datasets/ks-projects-201801-utf8.csv', 'rb') as file:
    guessed_charset = chardet.detect(file.read(10000))
print(guessed_charset)

{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}


In [4]:
# ---- Guessing Charset ----
#
# - as more bytes we read, the higher is the confidence percentage
#
with open('./datasets/ks-projects-201801-utf8.csv', 'rb') as file:
    guessed_charset = chardet.detect(file.read(100000))
print(guessed_charset)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


In [6]:
# ---- Reading Dataset ----
ks_df = pd.read_csv('./datasets/ks-projects-201801-utf8.csv', encoding='utf-8', low_memory=False)

# ---- Saving the File into Windows-1252 ----
ks_df.to_csv('./datasets/ks-projects-201801-windows1252.csv', encoding='Windows-1252')

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).