# Making an simple akinator clone

This project started as a interview test for a position as a java programmer, after quick running and decompiling the .jar file given as an example I tought it would be a good exercise for Python, as you would in the end have a program that is suited to be a first pandas dataframe example.

Having little time to work on it and wanting to pratice my pandas skill I just pivoted it to a python project\
(and yes I intend to port it to java later for pratice, but tbh I'm focusing more on data analysis, so Python it is, for now)

The first thing before getting to the python code is understanding the scope of the project, the jar file sent was a simple path Akinator clone for animals, decompiling it showed that the logic in itself was simple:

This example code dissected on the first part is not made by me, was sent by e-mail to exemplify what the project was, it can be found on the example_project folder of the jupyterlab

---
# Dissecting the example code given
---

The example project contains 5 classes:

- Logica (Logic)
- Mensagem (Message)
- Funcoes (Functions)
- AnimalDAO (AnimalDataAccessObject)
- Animal

The Logic class has the main game loop and instantiates the other ones

The Message class has the UI components

The Functions class has 3 functions used on the Logic class

The AnimalDAO is a class used to acess and hold the List of Animal objects used in the game.

And the central piece to this code, the animal class:

## The animal class:

#### The class variables:

Every animal registered was an object of a simple Animal class that has:


    public class Animal
    {
       private Integer id;
       private String caracteristica;
       private String respostaSim;
       private String respostaNao;
       private Integer idPai;
       private Integer filhoDaResposta;
    ...

- id (int) who saves the index in the list that contains all animals objects,
- characteristic (caracteristica) (string) who is the base of the yes or no question associated with this object to the user, 
- 2 animals as answers to the characteristic question (respostaSim, respostaNao) (string) one for each response,
- (int) idFather (idPai) who holds the index of this object father on an strange binary search tree (latter will explain the implementation),
- (int) childOfAnswer (filhoDaResposta) who holds the path to get to this object from it's father (0/1, left/right, yes/no)

#### Understanding how the class works:

At first glance this looks like a node of a binary search tree, and it works like a binary search tree when you run the game, but the implementation takes a turn, instead of referencing the children, the only reference a node holds is of his parent and the decision from then to get to it.\
So in the end accessing a child directly is impossible the only way is searching the List present in AnimalDAO passing it, the index of the father node, and the path (0/1) that takes to that child, if the node is a leaf then you must need to search every single element to finaly get that information.\
When the active node is a leaf on the path chosen by the player a prompt with the animal saved on the respostaSim or respostaNao (1/0 paths) is presented to the player as his animal, if the player confirms the game ends, if he denies he then has a chance to add his chosen animal and a question that is used to define it, growing the tree.\
The new node is created as a child of the last node, his 'respostaXXX' variable on the path chosen is set to the new animal, and the variable of the oposite path is set as the animal denied. 

After understanding this unorthodox search system for a simple binary tree the rest of the code was easy to follow.

---
# The example code works. But how can it be better?
---

When playing and dissecting the example I noticed this 2 shortcomings:

1. The game does not try to awnser with the lowest amount of questions
2. The game has no persistent memory

So for my implementation I put the following users requests:

- The game should be optimized so it tries to get the chosen animal in the least possible number of questions, \
- the game has only one shot, so it only tries when enough information for a confident one is presented
- the game needs a persistent memory, so the game state needs to be saved before the game quits

And for myself the following restrictions where added:\
(as I'm doing this project to exercise)

- The code must accept new UI and Data managers:\
but the first UI module needs to be on command line (to be played here on this jupyter notebook)\
and the first Data must be CSV files (to be used for a future data analysis and ML project I have in mind)

With this rules we can begin showing the project, but first...

---
# Playing the game yourself
---

Below is a code block who will run the game if you are acessing this notebook in a interactive viewer

If you are new to jupyter notebooks:\
to run the block just select it(with the directional arrows on the keyboard, or clicking with the mouse) and press Ctrl+Enter

giving q as any input will end the game

In [1]:
%run ./animal_game/animal_game

------------------------
Think of an animal, then judge the facts about it with yes or no:
------------------------
Your animal is a mammal. (y/n)


 y


------------------------
Your animal survives outside of water. (y/n)


 n


------------------------
Your animal has stripes. (y/n)


 n


------------------------
Is the animal a whale? (y/n)


 y


---
# The game code
---

The game code can be found in the animal_game folder on this project [github](https://github.com/TMKlautau/animal_game_development) or jupyter lab hosted on [binder](https://mybinder.org/v2/gh/TMKlautau/animal_game_development/master), feel free to use any part of it

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/TMKlautau/animal_game_development/master)

To start we first need to import the game modules into our python console, this gives us easy acess to the modules and the helpful acess to the docs using the help function

(remember to run the code in its intended order)

In [2]:
from animal_game import animal_game

In itself the code is simple, but to start we first need to specify the command lines arguments that can be passed and what they do, to do this why not just see the help for the module?

In [3]:
help(animal_game)

Help on module animal_game.animal_game in animal_game:

NAME
    animal_game.animal_game

FUNCTIONS
    main(args)
        Script entry point for the animal game
        
        Defined args:
        
        -uw : (update weights) recalculates the weights of the questions and sort the optimal question order for the first question
        -rnd : (randomize questions) randomizes the questions order, used to train questions with low percent of data
                and making the game non-linear after a database is optimized
        -pd : (print data) debug function to print the dataframes on the console, best used for small dataframes, for big ones the best is to acess it directly
        -sg : (skip game) used to perform the actions passed on the other arguments without starting the game after

FILE
    /home/tmk/Desktop/animal_game_development/animal_game/animal_game.py




The entry point for the script just instantiates an Ui, Data and Logic modules, parses the command line arguments, and then starts the execution using the start_execution funcion on the logic module (who would saw this coming?)

## The UI module:

The Ui module is a simple base class delimiting the functions that modules need to implement, and the only implemented Ui module until now (the command line one)

below is the help page for it, I will not go into detail on it as its a simple self explanatory module.

In [4]:
help(ui.Ui_module)

Help on class Ui_module in module ui:

class Ui_module(builtins.object)
 |  Base UI class
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  get_delimited_input_with_text(self, text:str, values:tuple) -> str
 |      Present text to user and waits for a delimited input
 |      
 |      Args:
 |          text (str): the text to be presented
 |          values (tuple): tuple of values accepted as input
 |      
 |      Returns:
 |          str: input from user
 |  
 |  get_free_input_with_text(self, text:str) -> str
 |      Present text to user and receive input
 |      
 |      Args:
 |          text (str): the text to be presented
 |      
 |      Returns:
 |          str: input from user
 |  
 |  get_str_input(self) -> str
 |      Receive input from user on the str format without presenting a text
 |      
 |      Returns:
 |          str: input from user
 |  
 |  present_text(self, text:str)
 |      Pr

## The Data module:

The Data module implements the persistent memory of the game, and sets a layer of abstraction on some game logic that is dependent on the data storage method, so the logic module can work with any given type of storage just passing that type module subclass into it's constructor.

And here is the call to his help page:

In [5]:
help(data.Data_module)

Help on class Data_module in module data:

class Data_module(builtins.object)
 |  Base data manipulation module
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  add_animal_with_dict(self, animal_data:dict)
 |      Adds an animal with the data passed on the dict (name obrigatory)
 |      
 |      Args:
 |          animal_data (dict): data of the animal to add
 |  
 |  add_question_with_text(self, text:str) -> str
 |      Adds a question with the text passed
 |      
 |      Args:
 |          text (str): text of the question to be added
 |  
 |  calculate_questions_weights(self)
 |      Changes the order of the questions using a weight system
 |      (the weight is the percent of animals with data times the difference between the 2 options, so weight correlates with the number of removed options if a question is presented)
 |  
 |  check_only_one_valid_animal_by_question_index(self, question_index:int) 

Like the Ui module, the Data only has one subclass module implemented at the moment, the Csv_data_module, who as the name suggests uses CSV files as the storage method.

The module uses 2 tables who are stored on the './dataframes/csv' folder, one holds the animals informations, the other the questions.\
(still need to implement the binary tree to hold a better question order, when done will be 3 tables) 

To see them, first lets import pandas, then create a dataframe called questions_df and one called animals_df then print the heads:

In [6]:
import pandas as pd

In [7]:
questions_df = pd.read_csv('./animal_game/dataframes/csv/questions.csv', index_col=0)
animals_df = pd.read_csv('./animal_game/dataframes/csv/animals.csv', index_col=0)

display(questions_df.head())
display(animals_df.head())

Unnamed: 0,id,text,order,weight
0,q2,is a mammal,2,0.857143
1,q0,survives outside of water,0,0.47619
2,q5,has stripes,5,0.285714
3,q1,has a fin,1,0.285714
4,q3,is a reptile,3,0.285714


Unnamed: 0,name,q0,q1,q2,q3,q4,q5,q6,q7,q8
0,shark,0.0,1.0,1.0,0.0,,,,,
1,monkey,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,
2,whale,0.0,1.0,1.0,0.0,1.0,0.0,,,
3,barracuda,0.0,1.0,0.0,,,,,,
4,snake,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,


As you can see on the questions dataframe each one has:

- id\
(used to correlate the question on the animals dataframe)
- text\
(the question in itself, presented to the user)
- order\
(the insertion order on the table, and the index is based on the weight)
- weight\
(a simple weight sistem used to determine the next question to be presented to the user based on the actual dataframe snapshot)

and the animals dataframe has:

- name\
(name of the animal)
- q0 : qX\
(data correlated as the answer to the question with the id equal to the name of the column, possible values are 0 for no, 1 for yes, and np.NaN for lack of info on that animal for that question)

### Important methods:

there is one method where its a good idea the explain the logic behind:

#### calculate_questions_weights:
Lets look at the help for it:

In [8]:
help(data.Data_module.calculate_questions_weights)

Help on function calculate_questions_weights in module data:

calculate_questions_weights(self)
    Changes the order of the questions using a weight system
    (the weight is the percent of animals with data times the difference between the 2 options, so weight correlates with the number of removed options if a question is presented)



and at his code:

    def calculate_questions_weights(self):
        for row in self._questions_df.index:
            aux = self._animals_df.loc[:,self._questions_df.loc[row].id]
            self._questions_df.loc[row,'weight'] = (1 - abs(len(aux.loc[aux == 1]) - len(aux.loc[aux == 0]))/len(aux.dropna())) * (len(aux.dropna())/len(aux))
            self._questions_df = self._questions_df.sort_values('weight', ascending=False).reset_index(drop=True)
            self.save_questions_to_disk()

This method starts to resolve the first rule:

- The game should be optimized so it tries to get the chosen animal in the least possible number of questions

So the best course of action is to prioritize the questions who will invalidate the most animals in the database no matter the answer.\
With that in mind there are 2 proprieties on a question that should be used:

1. the number of animals who it has information on
2. how skewed it is to one of the answers

The percentage of animals that has information on a question is easily calculated using:

    (len(aux.dropna())/len(aux)
    
as it can be reached by dividing the number of non missing data on the question with the total number of animals on the database

The second point can be measured with this bit of code:

    (1 - abs(len(aux.loc[aux == 1]) - len(aux.loc[aux == 0]))/len(aux.dropna()))

it represents the distance to a perfect balanced question ("as all things should be"), 1 represents the same number of animals have 'y' and 'n' as awnsers to it.\
as even if a heavily skewed question can pinpoint an animal with only one answer, most of the time it will only invalidate this one animal, and in a game of probability you dont put all your chips on a miracle.

## The Logic Module

The Logic module implements the game logic, and again, lets start by having a look at his help page:

In [9]:
help(logic.Logic_module)

Help on class Logic_module in module logic:

class Logic_module(builtins.object)
 |  logic components
 |  
 |  Methods defined here:
 |  
 |  __init__(self, ui_module:ui.Ui_module, data_module:data.Data_module)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  get_sup_modules_types(self)
 |      Resturn the suport modules types associated with the logic module
 |      
 |      Returns:
 |          tuple of str : contains id of the ui and data module
 |  
 |  start_execution(self)
 |      Starts the game execution
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



### Important methods:

But this time to understand the Logic module we will have to go beyond the help page of the class, bcs the start_execution is an obfuscation of the 3 private methods that do the work, those are:

- _discover_animal_with_simplified_order
- _animal_not_found
- _animal_found
- _ambiguous_animal_found

as a means to explain here I updated all 4 with docstrings (as private methods docstring aren't needed), and I'm not showing code, but fell free to look them up on the repository.

#### _discover_animal_with_simplified_order

Lets see his help page:

In [10]:
help(logic.Logic_module._discover_animal_with_simplified_order)

Help on function _discover_animal_with_simplified_order in module logic:

_discover_animal_with_simplified_order(self) -> (<class 'str'>, <class 'dict'>, <class 'int'>)
    Tries to discover the user's animal by process of elimination using a simplified ordering by weight
    The ordering only uses the weights of the questions calculated using the full dataframe, so only the first question for sure at it's best position
    
    Returns:
        str: the name of the animal chosen by the user, only returned if only one value with complete information is found at the end of the elimination process
        dict: a dict with the mapping of values to questions id who set the dataframe to this position
        int: the number of animals that the dataframe has complete information that can be the one chosen by the user



this is a simplified order because the order is a simple sort of the weights calculated on the initial dataframes and saved, a complete ordering would recalculate the weights after every question on the dataframe snapshot generated by that question answer. (or we can just pre-calculate all and save on a binary search tree, with is a next step on the project)

#### _animal_found

Again, you know the drill:

In [11]:
help(logic.Logic_module._animal_found)

Help on function _animal_found in module logic:

_animal_found(self, name:str, answers_dict:dict)
    Resolves the case when the animal was found, or by process of elimination or by input of the user
    
    Args:
        name (str): the name of the animal
        answers_dict (dict): a dict with the mapping of values to questions id who set the dataframe to this position



This is the function called when the game has only one valid animal on the dataframe snapshot and it hits the mark, or when he tried his luck and missed but the animal was on the database, he just didn't had enough info on it to make a correct prediction, but as this function updates the animal with the new data entered by the user responding the questions next time he will get it right.

#### _animal_not_found

Help page to the rescue:

In [12]:
help(logic.Logic_module._animal_not_found)

Help on function _animal_not_found in module logic:

_animal_not_found(self, answers_dict:dict)
    Resolves the case when the animal could not be found by process of elimination
    
    Args:
        answers_dict (dict): a dict with the mapping of values to questions id who set the dataframe to this position



this is the function called when the game gave up on finding the animal, it asks the user for the animal name and updates the database with this new data.\
(It may have lost the game, but it won new data, so in the end did he rly lost?)

#### _ambiguous_animal_found

and for the last time:

In [13]:
help(logic.Logic_module._ambiguous_animal_found)

Help on function _ambiguous_animal_found in module logic:

_ambiguous_animal_found(self, answers_dict:dict, number_of_valids:int)
    Resolves case when more than one animal have the same values, are the only ones remaining and there is no more questions to differenciate then.
    
    Args:
        answers_dict (dict): a dict with the mapping of values to questions id who set the dataframe to this position
        number_of_valids (int): the number of animals with identical values left



this last function is a one of a kind, when 2 or more animals are left, but the remaining questions dont have data that can differentiate beetwen then, this is the funcion to call.\
It first asks what is the animal chosen by the player, then gives the player a opportunity to add a new characteristic of this animal, that will in turn be generated as another question, then it adds the question and updates or adds the animal.

---
# IT'S ALIVE! AND (kinda) LEARNING! What now?
---

For +-10 hours of work on the project, IT WORKS!\
So what can we do now?

The first thing is training more, at the moment the animal dataframe has:

In [14]:
len(animals_df)

21

animals and the questions dataframe has:

In [15]:
len(questions_df)

9

questions, so a long way to go...

but after training a bit, we can get enough data to justify implementing a binary tree to save the weight of every snapshot (or for the first 20 or so if storage space is a delimiter) so we have faster predictions.

And with the data gathered we can even implement questions skips based on the correlation coeficients with the already asked questions, for even faster predictions!

In [16]:
display(questions_df[(questions_df.id == 'q0') | (questions_df.id == 'q1')])

animals_df.q0.dropna().corr(animals_df.q1.dropna())

Unnamed: 0,id,text,order,weight
1,q0,survives outside of water,0,0.47619
3,q1,has a fin,1,0.285714


-0.7171371656006367

because as we can see the question q0 and q1 have a pearson correlation coeficient of -0.717, we now know they have a strong inverse correlation.\
So is worth to ask if the animal has a fin right after it was already answered that he survives outside of water? The weight can be ajusted by that too...

After some training we can get some neat info on what questions and animals the users insert, and get some fun insights like what mammals on the database are associated with nordic mythos:

In [17]:
animals_df.loc[(animals_df.q2 == 1) & (animals_df.q7 == 1), 'name']

11     wolf
19     goat
20    horse
Name: name, dtype: object

In the end:\
the possibilities are endless, but my time unfortunately is not, there are others projects to do and more information to learn, sometime later I will come back and revisit this one, but for now this was a good quick exercise.

If you find this project and wants to use the Python code in here, its yours.