## Libraries

In [None]:
import pandas as pd
import glob

---

## ANSI Encoding Error

Some datasus files have encoding different from UTC-8, this function fixes that and unite files from same year

In [None]:
def ansiFix(path, pattern, outputPath):
    files = glob.glob(f'{path}/{pattern}')

    aux = pd.DataFrame()

    for item in files:
        try:
            df = pd.read_csv(item)
        except UnicodeDecodeError:
            df = pd.read_csv(item, encoding='ANSI')

        aux = pd.concat([aux, df], axis=0)

    aux = aux.drop_duplicates()
    
    aux.to_csv(outputPath, index=False)

Example use:

In [None]:
path = 'input'
for i in range(12, 23):
    pattern = f'\\*20{i}.csv'
    outputPath = f'output\\DO20{i}.csv'

    ansiFix(path, pattern, outputPath)

---

## Double Header Error

Some files also had an error, the N first lines were copied again on the same file, which I'm calling the Double Header Error. This function should be able to fix it.

> Warning: The function won't fix the error if the copied lines have different type. Ex. '1' and '1.0', '1' and '"1"' etc.

In [None]:
def dhError(path, pattern, outputPath):
    files = glob.glob(f'{path}/{pattern}')

    for item in files:
        df = pd.read_csv(item)

        for coluna in df.columns:
            df.drop(df.loc[df[coluna] == coluna].index, inplace=True)

        df = df.drop_duplicates()
        
        df.to_csv(outputPath, index=False)

> Note about the Warning:
The files actually had different types between copy and copied, so this function was only able to remove the double header

Example use:

In [None]:
path = 'input'
pattern = f'\\*.csv'
outputPath = f'output\\*.csv'

dhError(path, pattern, outputPath)