In [1]:
import pandas as pd

## 01 - Encodings...

### Quick backstory

Nobody in the history of anything ever has enjoyed dealing with encodings...

But it is something that must be done! You can see here a list of
[standard encodings](https://docs.python.org/3/library/codecs.html#standard-encodings) that 
python supports.

So what is encoding exactly? To answer as simply as possible: it is the rules by which you are to interpret the sequence of 0s and 1s that a file is stored as. Back in the day when digital computers were first becoming
a thing, it was only english-speakers using them and since the english alphabet is pretty simple, you only
needed to represent [127 different "ascii" characters](http://www.asciitable.com/) to do your work. Well it turns out
that there are other languages than english so people started coming up with different encodings for how
the 0s and 1s should be stored and interpreted to represent various characters.

Chaos ensued and still does.

Different operating systems, programs, websites, people, etc. used different encodings, many times for
the same characters. That means that you could store the same character in two different ways even
though they have the same semantic meaning. Bummer sauce.

### The consequences

The single most important consequence that you must know about is: *you must know the encoding of a file before you read it*. If you don't know the encoding of the file before you read it, you never know what is going to happen. It
might throw an error, it might work but show you some garbled message, you just don't know.

Great, so just ask the person who gave you the file what encoding the file is in. Easy right? Sure thing, if they have ever heard of the word. It's quite common for someone to send you a file and not know how it is encoded.

### Where does that leave us?

It leaves us with a guessing game. There's a few rules you can stick by and try to keep in your head though:

1. Always use utf-8 if you can. Most programs can read it by default.
1. If people are sending you a file from a windows machine, there's a good chance it will be ISO-8859-1.

### Let's see what happens

When you use different encodings to store datasets

In [16]:
# First let's take a look at non-ascii characters
pd.read_csv('hello.csv', encoding='utf-8') # utf-8 is the default BTW

Unnamed: 0,hello
0,ã
1,é
2,a


In [17]:
# now let's see what happens when we try to read it
# with the wrong encoding specified
pd.read_csv('hello.csv', encoding='ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

### What does all of that mean?

Don't worry about it for now, it's a big subject that I'm not going to pretend to be an expert
on. The main thing you need to know is: you specified the wrong encoding and python
throws a `UnicodeDecodeError` when this happens. 

In [15]:
# now let's take a look at another much more scary case
# in which the read doesn't throw any errors but gives
# you the wrong character representations
pd.read_csv('hello.csv', encoding='cp1254')

Unnamed: 0,hello
0,Ã£
1,Ã©
2,a


### This is particularly scary

What can happen in this case is in a very large CSV that is 90% correct
you might not notice any problems on a call to `head()` or any other summarizing
methods but then later on down the road when you are preparing a presentation,
you might get some gibberish that you didn't notice before.

The solution is the same as always: do your best to find out the encoding from the
person that created the file!

If you can't do this, there's another route you can take and try to use
tools to guess the encoding. However these methods are not 100% reliable.

One tool I've used with varying success is [chardet](https://chardet.readthedocs.io/en/latest/)

## 02 - Reading excel files

If you are given an excel file or you need to report back to someone an excel
file. There's really not much to know here, just be aware that you can read
and write excel files using pandas.

In [22]:
pd.read_excel('squirrel.xlsx')

Unnamed: 0,Hello,world
0,1,2
1,3,4
2,5,6
3,7,8
4,9,10


In [26]:
pd.DataFrame({'hello': [1, 2, 3]}).to_excel('my.xlsx')

In [27]:
! head my.xlsx

PK    �U6K#��        _rels/.rels��O��@��J��W��b=y�mY�q&�C;�!�~{��l�TP�^���G�?4�vR��T�~�4�j�H�%�i��BVj��Gi ���!���;���9���E�J#��4�	�!-�8���?3��l��[�gB��;KG�OA�g�Y6,���p��,���
���C�_���H���h� �1����#7#�?p�PK    �U6Kf�`�   �      docProps/app.xmlM�M�0D�J��n)�Ab@�G����nl ���
�����y�auc����t5�TN�*�� ��M�N�8�h�!?���/d���4��*�\v�;�ku�9xk�S�Wo�
9��j1(��[��\�<������I�PK    �U6K��|^�   +     docProps/core.xml���N�0Ey����P��+���bg���j��=(����6�`���3g�iu:$|I!b"��jp��B�5�E�������cs�S4>����C�9����"�����h�*
��7z�����0�;t�)CUV��41���`�&��h�\�;w���C�K����o���C��O��������8��V�1���'�5��G&k^�|U������F4������E�c���e��B~PK    �U6K�\�#  �'     xl/theme/theme1.xml�Z[s�8~���xg�m�6���siv�����N�X�lyd���G6����M��<,���EG��8y��.b�����x`�/����/��W2$A0����
�L^�Zi �8}�����Kx��\�[/#�����V�il�Gd`}^,h@�TQZo_ ��3��T�e�WA&�����l����>e��:2�n0X �o��NZ��T���jg?Vk���H���}��I���2;:�X�v|�������t4m���

Hooray! Now you can send a utf-8 encoded excel file to the biz dev department!

## CSV, TSV, and other separator woes

There are lots of ways to separate values in a text file. This
is part of the problem with storing data in a format where
values are separated by arbitrary values: there are no inherent
types and anything goes!

A very common problem that people run across in southern europe
is how decimals are encoded. For example, in portuguese you 
use the comma (,) rather than a period (.) to indicate a decimal.

In [32]:
! cat comma-decimal.csv

numeric;categorical
1,2;a
2,2;b
1000,9;c


### What's going on here?

As you can see, since we are using a comma to denote a decimal
and a semi-colon (;) to separate columns. That means we must
use two extra parameters in the read_csv to catch this.

In [33]:
pd.read_csv('comma-decimal.csv', decimal=',', sep=';')

Unnamed: 0,numeric,categorical
0,1.2,a
1,2.2,b
2,1000.9,c


If you forget to do this, you will get some proper non-sense!

In [35]:
pd.read_csv('comma-decimal.csv')

Unnamed: 0,numeric;categorical
1,2;a
2,2;b
1000,9;c


## Other random problems

Imagine that a confused someone creates a file in which there's a totally
random row inserted into the middle of it. Your entire csv may be okay
but because of this one pesky little row, you are totally screwed!

In [44]:
pd.read_csv('confused.csv')

ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 5


Let's check out a few line in the file around the problem line

In [46]:
! cat confused.csv

hello,world,again
1,2,a
3,4,b
squirrel;blah;what???,,,,
5,6,c
5,6,c
5,6,c
5,6,c
5,6,c
5,6,c


Aha! Looks like a mostly respectable file except for one pesky line in the middle of it.
Now that we can be reasonably assured that it's alright to skip all problematic lines because
we won't be getting rid of a ton of data, let's go ahead and do so.

In [47]:
pd.read_csv('confused.csv', error_bad_lines=False)

b'Skipping line 4: expected 3 fields, saw 5\n'


Unnamed: 0,hello,world,again
0,1,2,a
1,3,4,b
2,5,6,c
3,5,6,c
4,5,6,c
5,5,6,c
6,5,6,c
7,5,6,c
