In [30]:
import os, codecs
import pandas as pd
from pandas.api.types import CategoricalDtype

In [31]:
data_folder = '../data/'
df= pd.read_csv(os.path.join(data_folder,"El_Maestro_for_column_splitting.csv"), sep=";")

In [40]:
df.head()
#I am very much still in the input phase with this dataset. 
#For this example, I have kept the 64 first observations only and have removed various columns which are not of interest here and are not completely filled yet.
#I also need to review the column names, and will cast the data types in due course (date and page numbers for instance)

Unnamed: 0,Num_absoluto,Tomo,Num,dia,mes,ano,autor,idioma_original,traductor,pag_ini,pag_fin,extension,largo_num,signature_type
0,1,I,1,1.0,4,1921,Romain Rolland,francés,anon,11.0,12.0,2.0,96,anonymous unidentified
1,1,I,1,1.0,4,1921,Leon Tolstoi,ruso,anon,41.0,53.0,13.0,96,anonymous unidentified
2,1,I,1,1.0,4,1921,Bernard Shaw,inglés,anon,54.0,58.0,5.0,96,anonymous unidentified
3,1,I,1,1.0,4,1921,Esopo,,anon,94.0,94.0,1.0,96,anonymous unidentified
4,2,I,2,1.0,5,1921,William Swinton,inglés,anon,21.0,26.0,6.0,112,anonymous unidentified


### Tidying the "traductor" column
The column titled "traductor" includes both the translator's name, when available, and what I call the signature type. This refers to the presence of either a named translator, or, in its defect, to the mention that the text is a translation (usually "Translated especially for...", or some reference to the translation process made in an introductory note). I have therefore defined three possible categories for this: "Signed," "Anonymous identified," and "Anonymous unidentified" (where the last category applies to texts that bear no mention or hint whatsoever that they are translations, which is a rather common occurence in my corpus). Of course the status of the latter as versions can be deduced from the original author's name, but this is not always straightforward, and I am interested in recording the magazine's ways of making visible (and possibly showcasing) the translation process in itself. 
Below is the solution that I have devised to split this column in two.

In [33]:
df.shape

(64, 13)

In [34]:
df.dtypes

Num_absoluto         int64
Tomo                object
Num                 object
dia                float64
mes                  int64
ano                  int64
autor               object
idioma_original     object
traductor           object
pag_ini            float64
pag_fin            float64
extension          float64
largo_num            int64
dtype: object

In [35]:
df["signature_type"]=df["traductor"] #creating a new signature_type column

In [36]:
df.signature_type.replace(to_replace="anónimo, pero sí se menciona que se trata de una traducción",
                value="anonymous identified", inplace =True)
df.signature_type.replace(to_replace ="anónimo, no se menciona que se trata de una traducción", 
                 value ="anonymous unidentified", inplace=True) 

cat_type = CategoricalDtype(
    categories=["anonymous identified", "anonymous unidentified", "signed"],
    ordered=True
)

df.signature_type = df.signature_type.astype(cat_type) #the signature_type can correspond to three categories: anonymous idenfitied, anonymous unidentified and signed

In [37]:
df.signature_type.fillna('signed',inplace=True) #the missing values (where the name of individual translators appeared previously) are replaced by the "signed" category

In [38]:
df.traductor.replace(to_replace="anónimo, pero sí se menciona que se trata de una traducción",
                value="anon", inplace =True)
df.traductor.replace(to_replace ="anónimo, no se menciona que se trata de una traducción", 
                 value ="anon", inplace=True) #signature types replaced in the translator column

In [39]:
df[["traductor", "signature_type"]]

Unnamed: 0,traductor,signature_type
0,anon,anonymous unidentified
1,anon,anonymous unidentified
2,anon,anonymous unidentified
3,anon,anonymous unidentified
4,anon,anonymous unidentified
5,anon,anonymous unidentified
6,anon,anonymous unidentified
7,anon,anonymous unidentified
8,anon,anonymous unidentified
9,anon,anonymous unidentified


### Mission accomplished
I will now need to work on some translator name disambiguation.